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Preface 


This book provides an account of the ideas underlying the design of 
experiments in education. The need for such a book is evident from the 
increasing attention now being given to educational research, and the 
consequent necessity to provide further training for teachers and social 
workers in the techniques of experimentation. A related consideration is 
that up to the present almost all the books on experimental design base 
their accounts on illustrations from the fields of industry, agriculture or 
medicine, It is hoped, therefore, that the present book, written largely— 
though not exclusively—for the research worker in education, will serve a 
real need. 

An attempt has been made to develop the subject without assuming 
very much prior knowledge of statistics. Thus, in chapter 1 the reader is 
introduced to such basic concepts as statistical error, randomization, 
hypothesis-testing and estimation. Many readers, however, might prefer 
to come to grips with this discussion only after having read a basic text on 
statistical methods, such as the writer's Statistical Methods in Education. 
The general plan has been to explain the purpose of a particular design, 
then the method of computation, and then the theoretical model on which 
it is based. A knowledge of elementary algebra, or at least of algebraic 
symbolism, is necessary if the exposition of the models is to be followed in 
full; yet it is hoped that even those averse to mathematical expression may 
follow the main themes of the book. 

The designs chosen are the ones most frequently applied in educational 
research. They include those of randomized groups, randomized blocks, 
the covariance design and the factorial design, with its modifications to 
allow for the nesting as well as the crossing of variables. All assume 
measurement on an ordinal or interval scale, and so permit an analysis of 
variance. Detailed attention has been given throughout the book to the 
calculation of sums of squares and degrees of freedom, since the student 
wishing to apply these procedures to his own data will need firm assurance 
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here. References for further study are given at the end of the chapters. 

I am grateful to Professor F. W. Warburton of the University of Man- 
chester for reading the manuscript and for making many helpful sug- 
gestions. Thanks are also due to the publisher's readers for their comments. 
I am, of course, solely responsible for any defects that remain. Finally I 
must record my thanks to Miss Mary Flint of University of London 
Press Ltd for the efficiency and courtesy with which she has dealt with 
all matters leading to the publication of this book. 


D. G. Lewis 
June 1967 


"Statistical procedure and experimental design are only two aspects of the 
same whole, and that whole comprises all the logical requirements of the 
complete process of adding to natural knowledge by experimentation. 


Ronald A. Fisher, The Design of Experiments 
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Chapter 1 Basic Concepts 


1.1 The recognition of statistical error 


It is nowadays commonplace to recognize the contribution of statistics to 
experiment in education and the social sciences. Investigators, including 
those still learning the techniques of research by writing theses for higher 
degrees, are aware that their results must be assessed statistically. They 
realize that findings obtained from relatively small numbers of subjects, 
the sample, have to be generalized, if at all possible, to include the larger 
numbers with respect to which the sample may be considered a repre- 
sentative part. Statisticians are therefore consulted to aid in formulating 
the conclusions that may properly be drawn from the experiment. 

There is, however, a noticeably smaller degree of awareness that the 
help of statisticians could well be sought before the experiment is con- 
ducted, that statistical principles should in fact govern the actual planning 
of the experiment. A failure to realize this could well lead to disappoint- 
ment, in that the data might have been obtained in a way that diminished, 
if not virtually excluded, the possibility of a particular hypothesis being 
verified. Of course, such verification could never be absolute. A margin of 
*error'—statistical error—is necessarily present, however skilfully the 
experiment is designed. But with an unplanned, or haphazardly planned, 
experiment this margin of error could well be, and probably would be, 
unnecessarily large. 

Statistical error in the social sciences may be viewed as a result of the 
perversity of human nature: human beings invariably differ from each 
other in all manner of ways. Experimental results will vary in different 
experiments, even when these experiments are repetitions of the same 
basic one and are all conducted in a similar manner but with different 
Samples of subjects. They would also vary if the same subjects were 
tested on a second occasion. Allowance for this—the fundamental fact of 
individual differences—needs to be built in to every experimental design. 
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As an illustration, consider the two sets of eight scores shown in table 
1.1. They are, we will suppose, the scores of eight arts and eight science 
students on a test of creativity, a test which probes the creative aspects of 
one's thinking, such as the ability to formulate an answer to an unexpected 
question rather than to select an answer from a number of given alter- 
natives. Considerable variation in the scores is apparent, though many 
psychologists might comment that as an illustration of the data they work 
with the scores in table 1.1 are remarkably uniform. The purpose of the 
investigation would be to compare the creativity of arts and science 
students. 

Table 1.1 Scores of eight arts 
and eight science students on a 
creativity test 


Arts Science 
63 58 
61 56 
58 56 
57 54 
55 51 
52 50 
52 48 
50 47 
Total score 448 420 
Mean score 56-0 52:5 


Our interest, of course, is not in the scores themselves. Each of the eight 
arts students' scores is of interest only in so far as it represents the test 
performance of arts students generally; and the same is true for the eight 
Science students’ scores. Obviously, then, it is the mean score of each of 
the groups which merit our attention in the first place. Further, the dis- 
persion of scores within each of the two groups are instances of statistical 
error. 

An additional point is that statistical error necessarily affects the value 
of the mean scores themselves. If, in other words, the experiment were 
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repeated with different students, different mean scores would almost 
certainly result. How, then, can we be sure that the present difference in 
mean score, and, in particular, the direction of the difference—which 
indicates that arts students generally are more creative in their thinking 
than science students—is reliable? To obtain an answer to this question, 
statisticians consider the experiment to be just one of a series of similar 
experiments, experiments identical in design and implementation, but with 
different subjects. With a long series of such experiments, the differences 
in mean score could be expected to cluster round some fixed value. This 
value—which would in fact be the mean of the series of differences in 
mean score—is termed the true difference. The problem then becomes one 
of generalization. Knowing only the one obtained difference, what can be 
inferred about the true difference? 


12 Hypothesis-testing 


One approach to the problem of generalization consists of formulating, 
and then attempting to disprove, a mull hypothesis. A null hypothesis states 
that there is no true difference between the measures being sampled, so 
that any obtained difference is due to chance, i.e. to the fluctuations of 
sampling.* It follows that, for the data of table 1.1, the difference in means 
Of 3:5 must be viewed as one of a series of possible differences—obtained 
from repetitions, or replications as they are termed, of the same basic 
experiment—the distribution of these differences being centred at zero. 
Statistical techniques have been developed giving the frequency with which 
differences at least as great as the obtained difference would then arise. 
(The frequency with which the difference of 3-5 in table 1.1 would arise is 
evaluated later, pp. 29-30.) If this frequency is sufficiently small—and the 
interpretation of ‘sufficiently’ is left to the investigator—the null hypo- 
thesis is rejected, and a true difference accepted. : 

A common practice has been to accept a true difference only if the 
frequency, expressed as a percentage of the total frequency, does not exceed 
5 per cent, or alternatively 1 per cent. These criteria provide the 5-per-cent 


* More generally, a null hypothesis must be an exact hypothesis. Thus it 
Could state the true difference to be any fixed amount, though in practice this 
amount is almost always taken as zero. On the other hand, the alternative to the 
null hypothesis as stated above, namely that the difference is some (unspecified) 
non-zero amount, is not eligible as a null hypothesis because it is not exact. 
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and l-per-cent levels of significance. They are, however, of only con- 
ventional importance, and an investigator is free to adopt other levels if he 
wishes. Two conflicting aims are involved. One is to reduce the likelihood 
of a null hypothesis being rejected when it is true, an aim which is achieved 
by lowering the frequency. This, however, renders more difficult the 
realization of the second aim, that of not accepting a null hypothesis too 
often when it is false. A balance between these two aims must be struck in 
the light of the practical considerations involved. Thus, if the acceptance 
of a true difference was likely to involve sweeping changes in educational 
practice, with the expenditure of a large sum of money, the reduction in the 
likelihood of a true hypothesis being rejected becomes the more important 
consideration. A rejection of a null hypothesis which happens to be true 
is termed an error of type 1, while an acceptance of a null hypothesis which 
happens to be false is termed an error of type 2. 

We need to consider, too, the alternative which would be accepted if the 
null hypothesis were rejected. In the present illustration, the alternative ! 
clearly that some difference (direction unspecified) exists between the 
creativity of thinking of arts and science students; the level of significance 
would therefore have to be decided by a two-tailed test. If, for example; the 
obtained difference of 3-5 in table 1.1 were found to be significant at the 
10-per-cent level, it would mean that differences of 3:5 or more in favour 
of the arts students would be found in replications of the experime? 
5 per cent of the time, and similar differences in favour of the scie” 
students a further 5 per cent of the time. Occasionally, however, difference 
in one direction only are expected—or, at any rate, any differences foun 
in the unexpected direction would necessarily be due to chance. If, 
instance, the same students sat the same, or a parallel, test on à seco? 
occasion—the purpose of the experiment being to gauge practice effect* 
true difference could only be such as to increase the score on the seco" s 
occasion. In such a case, the alternative to the null hypothesis woul f 

some difference in one (specified) direction, and a one-tailed test ?1 
significance would be appropriate. 


1. Estimation | 
| 

The main benefit from tests of significance is the caution they indu 

Investigators might otherwise be tempted into extravagant claims. 


same time the practical usefulness of these tests is limited. Often, t00» 
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cannot but seem artificial, in that for many experiments some true dif- 
ference is almost inevitable. The more important question would be 
whether the difference is great enough to justify action. An estimate of the 
true difference is usually a more worth-while goal. 

The procedure is one which results in two limits within which the true 
difference will probably lie, the degree of probability being precisely 
determined. The degree of probability in fact is decided on first. For 
instance, with an 80-per-cent degree of probability the calculation would 
result in limits which have an 80-per-cent chance (four chances out of five) 
of including the true difference between them. For the data of table 1.1 
Such limits would be 0-58 and 6:42 in favour of the arts students. If an 
80-per-cent probability is considered too low, a higher probability could 
be decided on—but naturally at the expense of widening the limits. With 
a 95-per-cent probability the limits, from the data of table 1.1, become 
—1:15 and 8-15, i.e. 1:15 in favour of the science students and 8-15 in 
favour of the arts students; and with a 99-per-cent probability the limits 
are —2-96 and 9-96. The limits are called confidence limits, those for the 
95-per-cent and 99-per-cent probabilities being the ones most often used. 
The result is always an assertion that the true difference lies between the 
calculated limits, together with a statement of the probability that this 
assertion is correct. " e 

The only practical outcome of the determination of confidence limits 
for the data of table 1.1 is that the data are indecisive. No firm decision 
that arts students generally have a greater creativity of thinking can be 
made, though there is some indication that this may be so.* The numbers 
of students (eight in each group) are too small—a conclusion that ue 
have been expected by anyone familiar with data of this type. If, on i ie 
Other hand, the 95-per-cent confidence limits had been, say, 2-00 an: Sd 
the decision as to the greater creativity of thinking in arts ees, Vien 
reasonably be made. We could still, however, question the psychological, 
as distinct from the statistical, significance of the result. A difference even 
of 4-5 points of score might be too small on which to ven gt 

Mendations for a change in the teaching of science students, 


Change in their syllabuses of study for *minority time". 


zero, on the other hand, 


roughly centred about dex 


* With widely separated limits Hs 
No indication of ato Rid is provided, not even that the two population: 
differ in the attribute tested. 


B 
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1.4 Randomization 


Tn our illustration of statistical error, test scores from the two groups of 
students were compared. We did not, however, define precisely what we 
were discussing. We really need to know who are the arts and science 
students whose creativity scores we are comparing. Are they, for instance, 
university graduates (and, if so, with honours degrees, or pass degrees, or 
either), undergraduates, or grammar-school sixth-formers; are they male 
or female, or both; are they selected from the whole country or just from 
one area (e.g. the Home Counties), or from one Local Education Authority, 
or even from one school? Obviously we need to define the population of 
students—or rather the two populations of students (one arts, one science) 
—we wish to compare. 

To fix our ideas, we will suppose that the two populations in which we 
are interested are those of grammar-school sixth-formers specializing in 
arts and science, and that we are confining our attention, initially at any 
rate, to one school. It is imperative, then, that we select random samples of 
arts and science sixth-formers from that school. (The test scores set out in 
table 1.1, in fact—if the previous discussion is not to be invalidated from 
the outset—must be the scores obtained by such samples.) A random 
sample is one in which every member of the population has an equal chance 
of being selected. This implies that no choice is left to the investigator: he 
cannot decide that any particular member must be included (or excluded). 
It also implies that the selection of any particular member in no way 
influences the chances of selection of another. The population, in fact, can 
best be defined as the aggregate or totality in which every member has an 
equal, and a known, chance of being selected in the sample. Generally the 
most satisfactory way of ensuring a random sample is to use tables of 
random numbers, such as those of Kendall and Babington Smith (1939), 
Fisher and Yates (1963), or Snedecor (1956). 

It may often be impractical to sample by direct randomization. This is 
usually so when the population is very large, and when it would therefore 
be very time-consuming to number each member, a necessary preliminary 
to the use of random number tables. In this case a stratified random sample 
is selected. The population is initially separated into strata on the basis of 
some clearly defined feature or ‘control’, and a random sample is selected 
within each stratum. The size of each of these samples, too, could be made 
proportional to the size of the stratum. The control determining the strata 


Basic Concepts 19 


is usually such that a considerable variation could be expected between the 
strata, the population being more homogeneous (with respect to the 
characteristic under investigation) within each of the strata. For an 
investigation on the creativity of thinking of sixth-formers, for example, 
the size of the sixth form, with its consequent influence on the educational 
facilities provided, might be considered an effective control. A stratified 
sample of sixth-formers could then be selected on the basis of sixth-form 
Size, categories of pupils in, say, large, medium and small sixth forms being 
formed, and random subsamples then being selected within each category 
of size. A further point is that it may prove unduly troublesome to select 
random samples of pupils, as the sample would be spread out over a large 
number of schools (with possibly only one or two pupils in several of the 
Schools). A: more practical procedure would then be to select in the first 
place random samples of schools. 

Despite these and other possible complications, randomization at 
Some stage is an essential ingredient in every experimental design. One 
reason is that it usually produces a reasonably representative sample, and, 
in particular, it avoids bias. In our arts-science illustration, for instance, it 
would obviously be wrong to select only the most able arts students if a 
Similar restriction were not made for those studying science. Exact 
equalization is rarely achieved, however. In any particular sample pair 
the arts students might still be, on average, more able. This, however, 
would be due to chance alone, and would be allowed for in tests of signifi- 
cance and in the estimations of a true difference. This brings us to the 
Second reason for randomness, namely that it permits the element of 
chance to operate in a way that can be rigorously assessed. These, and 
other related considerations, are developed more fully in an account by 
Cox (1958). 

Randomness, as we have seen, may be achieved in more than one way. 
Suppose, to develop the present illustration one stage further, that a sex 
difference in creativity of thinking has been suggested from previous 
research. It would then be sensible to select separately random groups of 
boys and girls (for example, four boys and four girls for each of the arts 
and science groups) rather than random groups of eight for boys and girls 
combined. It is true, of course, that selecting random groups of eight in 
this way could by chance produce groups of four boys and four girls. But if 
equal numbers of boys and girls were deliberately selected, the true dif- 
ference between the arts and science students would have to be estimated 
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in a different way. (The procedure would be that of a randomized-blocks 
design described in chapter 5, not that of a randomized-groups design 
described in chapter 3.) A valid determination of confidence limits (for the 
arts-science difference) is possible in both designs, and when the sexes are 
selected separately, confidence limits for the difference between sexes could 
also be obtained. This brings out the point that the design of an experiment 
determines not only what kinds of conclusions are possible, but also the 


mode of analysis necessary to arrive at them. Considerations of design are 
basic to all else. 


1.5 The key role of design 


Statistical error in the field of education is usually very considerable, as 
human material is so intrinsically variable, and, in fact, many investiga- 
tions would be doomed to inconclusiveness unless very large numbers of 
Subjects were used (a possibility often precluded by the limitations of time, 
labour and financial support), or else the investigation were designed in 
Such a way as to reduce the error to manageable proportions. Even if a 
superabundance of resources were available, it would still, of course, be 
wasteful to proceed with an inefficient design. This key role of design—that 
of reducing statistical error and so rendering the experiment more sensitive 
—may be seen in an illustration adapted from one first formulated by 


Fisher in his The Design of Experiments (1951), and which has since become 
classical. 


A person claims to be able to detect 
tea according to whether the tea or the 
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The possible results, together with their probability of occurrence on the 
basis of the null hypothesis, can then be readily specified. 

i Suppose that the complete experiment consists of five trials. For each 
trial the probability of the correct cup being chosen is 1 /4. With five trials 
the possible results, with their probabilities will be as follows.* 


24 0 

None correct or 3 correct 1224 
1 correct TLE 4 correct i024 
2 correct xx 5 correct rz 


If, therefore, the person achieves success in all five trials, the evidence 
against the null hypothesis is very strong, since such a result would be 
expected from chance only once in 1024 similar experiments—a probability 
of Toz or less than 0-1 per cent. If, however, the person makes one 
mistake, and succeeds in four of the five trials, the probability of this, 
together with that of the one better result, occurring by chance is 
102g = 1:56 per cent; so the null hypothesis could no longer be rejected, 
and the result accepted as significant, at the 1-per-cent level. Again, if the 
person makes two mistakes, choosing the correct cup in only two of the 
trials, the probability of this, together with that of the two better results, 
is 5, = 10-35 per cent. The null hypothesis would not then be rejected 
even at the 10-per-cent level. 

1 The possibility of mistakes should be considered, since the person 
might not claim infallibility: if he were to claim this, a single mistake 
would, of course, be decisive. His claim might be that he can correctly 
detect the differently prepared cup of tea more often than could be 
attributed to chance. He might even quantify his claim by saying that he 
could, on average, detect the ‘odd’ cup once in every two trials, i.e. a 
probability of 1/2 as against a chance probability of only 1/4. On this 
basis, therefore, the probabilities of the five possible results will be as 
follows: 


inomial expansion. If p is the 
—p) is the probability of this 
the probability of 0, 1, 2, ... 
terms of the binomial 
$andm =5. 


* The probabilities may be deduced from the b 
Probability of a single correct result, and q — (1 
Tesult not being obtained, then with a set of n trials 
Correct results being obtained is given by the successive 
expansion (g +p)". In the present experiment p = 4,9 = 


+ With the notation of the footnote above, p = 3, q = 3 and n = 5. 
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10 
None correct 35 3 correct 19 
5 
1 correct 3; 4correct 3 
1 
2 correct i9 5 correct 34 


If we had decided, when testing the null hypothesis, to adopt a 5-per-cent 
level of significance—so that either four or five correct results would be 


cent of the time. Our conclusion is that the experiment as at present 


Were designed so that even if more than one mistake were made (i.e. less 
than four correct results being obtained), the null hypothesis could be 
Tejected at, Say, the 5-per-cent level. 

In general, the Sensitiveness of an experiment can be increased in 
three ways. These are: 

l. Increasing the size of the experiment. 

2. Refining the experimental techniques. 

3. Altering the experiment's internal Structure, i.e, improving the 

design. 

The size of the present experiment would be increased simply by 
adding on more trials, Thus, the reader may care to verify that with six 
trials instead of five, the probability of one or no mistakes would fall to 
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additional precaution (an ingenious one possibly, involving extra labour 
and expense) can always be proposed. There comes a point when it is more 
Sensible to proceed. What is important, however, is that an extraneous 
influence which has not been controlled—i.e. equalized among the 
experimental material—should, if possible, be randomized. (Thus, while it 
would be easy to arrange for all the tea to be of the same kind, it would not 
be so easy to equalize the strength of the brew. But if in any one trial the 
five cups are poured from the same pot, the one with the tea poured in 
before the milk should not be the first cup from the pot in every trial. 
Rather the order of the cup chosen for the different treatment—the tea 
poured in first—should be selected randomly, and separately, at each trial.) 
. The third way of increasing the sensitiveness of an experiment—that of 
improving its design—may be seen from the fact that in the experiment 
Dow under consideration we need not arrange for only one cup of tea out 
of every four to have been treated differently from the rest. We could have 
two cups of tea treated in this way, the person claiming to be able to detect 
a difference in taste being asked to separate these two cups from the other 
two. In any one trial the probability of the correct two cups being chosen 
by chance is 1/6.* With five trials as before, the probabilities of the 


different possible results appear as: 


SO 
None correct 3433 3 correct 75% 
25 
1 correct 3433 4correct 723g 
2 correct 1238 S correct 35g 


A dramatic drop in the probabilities is evident. The chance probability of 
one or no mistakes is now 73$¢ = 0:33 per cent, so that with only one 
mistake the overall result would be significant at the 1-per-cent level. 
And with two mistakes the probability (of this and the two better results) 
is only 275. = 3:55 per cent, as against one of over 10 per cent before, 
SO giving an overall result significant at the 5-per-cent level. Clearly this 
change in design is well worth while. Moreover, the benefit would accrue 
whatever size of experiment (i.e. number of trials) was decided upon. 


* The first cup chosen would be any of the four, and the second any one of 
the remaining three, making 4x3 = 12 ways of choosing two cups in different 
orders. As the same pair of cups could themselves be chosen in two orders, the 
number of ways of choosing two cups without having regard to which is chosen 
first is 1/2 of 12 = 6. 
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The changed design makes use of the available resources in a more effective 
manner. 

The most effective use of the experimental material constitutes the key 
role of design. This is why an investigator—if he is not himself skilled in 
design—should consult a statistician before, rather than after, the experi- 
ment has taken place. Otherwise his experiment may not be sensitive 
enough to permit definite conclusions to be reached. Sometimes the nature 
of the problem may suggest that additional measures be taken, the extra 
trouble and expense being amply repaid by the increased information that 
will be forthcoming. Often more information can be obtained by grouping 
the subjects taking part, so that the groups will be more closely alike in 
certain respects. This grouping, too, may be carried out in more than one 
way; and possibly more than one set of measures may be taken from the 


same groups. A description of these possibilities forms the subject matter 
of this book. 
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Chapter 2 The £ and F Ratios 
2.1 The t test of significance 


A question that frequently occurs in educational research is whether an 
obtained difference between the mean scores of two groups may be taken 
as sound evidence of a real, non-chance difference. In tackling this, the 
sampling variability of each of the two means has to be considered. It has 
been found that the best way of assessing this sampling variability is 
through the sum of the squares of the deviations of the scores in each 
group from their mean. 

Let us consider again the test scores of the groups of arts and science 
students shown in table 1.1. The deviations of these scores from their own 
group mean (from 56-0 for the scores of the arts students, and from 52:5 
for those of the science students) are shown in table 2.1. The sum of 
squares of the deviations for each group—i.e. 7:02 4-50? 4- +++ +(—6:0)? 
for the arts group, and 5:5?+3-57+ -°° +(—5:5)? for the science 
group—works out as 148 and 116. These provide, in relation to the size of 
the group, an estimate of the sampling variability of the scores of arts and 
science students (of the type sampled) generally. An unbiased estimate of 
the population variance would, in fact, be provided for each category of 
Students by dividing the sum of squares by the group size minus 1. 

The sampling variability of a mean score, on the other hand—i.e. the 
extent to which a mean score can be expected to vary as different samples 
of the same size are randomly selected from the same population—is most 
conveniently expressed by its standard error. The standard error of any 
sample measure, such as a mean, is the standard deviation of its sampling 
distribution, or, more fully, the standard deviation of the distribution of 
measures that would result if large numbers of different samples of the 
same size were randomly selected from the same population. With a 
sample of size n, the standard error of the mean, om, is given by the formula 


(1) 


om = 


Sus 
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c being the standard deviation of the distribution of individual scores in 

the population. This formula assumes that the population is = 

theory infinitely large, so that by comparison the size of the sample i 
igible.* 

Mu with two mean Scores—means from samples selected 

randomly and independently from two populations—the standard error 

of the difference between the means C(w-M5) is given by 


a 
0i C 
9(Mi-M3) = {28 (2) 


2, and c; being the standard deviations of t 
tions, and n, and nz the two sample sizes. 
The practical di 


he scores in the two popula- 


» with the suffices 1 and 2 referring to the arts 
an unbiased estimate of o? is provided by 


ĉi = A = 2114 (see table 2.1) 
and one of c2 by 
116-00 
ôi = —— = 16-57 


i.e. that c, does not equal 
population variance by combining the 


: NO AE 
* If this were not the case, the formula would appear as oy = "A VI-9, 
n 
where 9 is the sam 


ipling fraction, i e. the fraction of the Population included in 
the sample. The factor V1—5 


I : Would also enter into the other standard error 
ormulae considered in this chapter, 
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148-00+116-00 264-00 
= ——__. = —— = 18: 
741 14 d 
This is the best estimate of the population variance (assumed to be the 
same for both groups) in that it is based on all the available data. Generally 
for samples of sizes n, and n, (not necessarily equal) the denominator for 
obtaining 6? would be (n, — 1) - (n; — 1) or n, 4-3 —2. 


ó? 


Table 2.1 Analysis of the data in table 1.1 


Deviations of scores from their group mean 


Arts Science 
7-0 553 
5-0 3:5 
2-0 3:5 
1:0 1:5 
—10 -—-]15 
—40 —2:5 
—4:0 —45 
—6:0 50 
Sum of 
Mem \ 148-00 116-00 
Estimate of i 116-00 
population IBO onda oa, e 16:57 
variance ? 
Estimate from | 148:00--116-00 — 264-00 — 18-86 
both samples 74-7 


S.E. of 
difference | V18-86 (12-1) = 2:17 


between means 


The standard error of the difference between two mean scores is now 
evaluated by replacing both o? and o in formula (2) by the best estimate 


2 
6^. The formula then becomes 


baril e(1.—) O 
1 2 
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The circumflex now appears over the symbol O(m-Mz2) Since this formula, 
unlike formula (2), provides only an estimate of the standard error of the 
difference between the means. The important point, however, is that this 
can be evaluated entirely from the test Scores themselves. For the arts- 
Science data of table 1.1, the estimated standard error works out as 2:17, 
as shown at the bottom of table 2.1. 

The ratio of an obtained difference in mean scores to the estimated 
standard error of this difference is known as the f ratio.* In symbols 


ra M,—M; (4) 
[n 1- M3) 

When the scores of each 

distributed populations—an: 

of the difference b 

tion of the ¢ ratio 


group are random samples from normally 
d as a consequence the sampling distribution 
etween the means is also normal—the sampling distribu- 
s will follow the distribution defined by 


z E opes 6) 
wG” 


see that it depends n. 
v, however, is the S 
6?, or—as it is usually termed—th 
6^ is based.t There are therefore 


ot only on t but also on y, a symbol not as yet defined. 
—2, the denominator used in obtaining 


* More generally the t ratio may be defined as the ratio of any statistic 
(mean, median, correlation, difference between two Correlations, etc.) to an 
unbiased estimate of its standard error, 

Population variance estimated 
€ sample size. For a population variance 


ent samples, the number is the sum of the degrees 
tes for the Separate samples, 
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Statistical table 1* shows for different numbers of degrees of freedom 
the values of £ necessary to enclose all except a given percentage of fre- 
quencies. Thus, for v — 14, the degrees of freedom on which the estimated 


Values of t 


Figure 1 The z distribution for a small number of degrees of 
freedom, compared with the normal distribution (shown by a 
broken line). 


Population variance in table 2.1 is based, a value of 2-145 is necessary for 
Significance at the 5-per-cent level (for a two-tailed test), since values 
greater than this (in either direction) will occur only 5 per cent of the 
i . 56:00—52:50 — 3:50 

time. The ¢ ratio resulting from this data is —5-7 — = 9.47 - 1-61, 


Which is less than this. It is less even than the value necessary for signifi- 
cance at the 10-per-cent level (1:761). Obviously no reliable evidence for a 
greater creativity in the thinking of arts students is provided by the data 
of table 1.1. 

Incidentally, the confidence limits described in chapter 1 (p. 17) are 
determined from the t values for 14 degrees of freedom recorded in 
statistica] table 1. Thus, those for an 80-per-cent degree of probability— 
Le. the limits so calculated failing to include the true difference 20 per cent 

* Note that the statistical tables are to be distinguished from the tables. The 


Statistical tables are numbered with a single number consecutively through the 
book; the tables are numbered with a double number by chapter. 
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of the time—are obtained from the t value in the 20-per-cent column, 
1-345, as 3-50+ 1-345 x 2-17 = 3:50--2-92 = 0:58 and 6-42. 


2.2 The computation of t 


Of the two groups of test scores set out in table 1.1, one has a whole- 
number mean and the other a mean with a simple fraction, so that the 
deviations from the mean and the sum of the squares of the deviations are 
obtained without trouble. Generally, however, a mean score does not 
Work out to a convenient value, and squaring and summing the deviations 
might well prove laborious (unless one has access to a calculating machine). 
A method of calculating the sum of the squares of the deviations without 
actually obtaining the deviations themselves would then be desirable. 

The method consists of summing the scores, and also the squares of 
the scores, as they Stand, and then correcting for the fact that it is the 
Scores, and not their deviations from the mean, which have been summed. 
This can be illustrated from the group of arts scores from table 1.1 set out 
again below. The letter x refers to any of the scores, and x? to the square of 
any of the scores. Y; (sigma) is the Symbol used to denote summing, so 
that for >) x we read ‘the sum of all the xs (i.e. the scores)’, and similarly 
for X x? we read ‘the sum of all the Squares of the scores’. The sums 


$ 
b X 
Dx and Xx? are obtained together with a correction term x) 


ae 
1 n the list (in this case eight). The correction 
term is then subtracted from Y x? to give what is termed the corrected 


sum of Squares, a sum which we see to be identical with the sum of squared 
deviations obtained in table PAR 


n being the number of scores i 


x x? 
63 3969 Correction term, 
61 321 (Sx)? 
x 3364 a 
3 
ss 3025 = 448x448 
52 2704 8 
52 2704 = 25,088 
50 2500 


Xx = 448 E x? = 25,236 
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Statistical table 1 Distribution of t* 
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EUR UU A c e a 
Degrees Deviates necessary to enclose all except the given 


[^] 
freedom 60% 40% 
1 0:727 1:376 


2 0-617 — 1:061 
3 0:584 0:978 
4 0-569 0:941 
5 0-559. 0-920 
6 0:553 0-906 
7 0-549 0-896 
8 0-546 0-889 


9 0-543 0-883 
10 0:542 0:879 
11 0-540 0:876 
12 0:539 0-873 
13 0:538 0-870 
14 0-537 0-868 
15 0-536 0-866 
16 0-535 0-865 
17 0-534 0-863 
18 0-534 0:862 
19 0-533 0-861 
20 0-533 0-860 
21 0-532 0-859 
22 0-532 0-858 
23 0:532 0-858 
24 0-531 0-857 
25 0-531 0-856 
26 0-531 0:856 
27 0-531 0-855 


a * Adapted from table 3 of Fisher, 
iological, Agricultural and Medical Research, 


Y permission of the authors and publishers. 


percentage of frequencies 

20% 10% 55$ 2% 1% 
3-078 6314 12-706 31-821 63-657 
1:886 2920 4303 6:965 9:925 
1:638 2353 3:182 4541 5841 
1533 2432 2776 3741 4604 
1:476 2015 2:57 3:365 4-032 
1-440 1:943 2-447 3:143 3-707 
1-415 1:895 2:365 2:998 — 3:499 
1-397 1-860 2:300 2:896 3:355 
1383 1:833 2262 2821 3-250- 
1372 1-812 2228 2-764 3169 
1363 1:796 2201 2718 3106 
1356 1782 2179 2681 3:055 
13350 1-771 2160 2:650 3-012 
1345 1761 2145 2624 2-971 
1-341 1753 2131 2:602 2:947 
1337 1746 2120 2583 2:921 
1-333 1740 2-110 2:567 2:898 
1-330 1734 2101  2:5532 2:878 
1328 1-729 2:003 2:539 2:861 
1-325 1:725 2:086 2:528. 2:845 
1-323. 1-721 2080 2:518 2:831 
1:321 1717 2074 2:508 2:819 
1319 1714 2069 2:500 2:807 
1-318 1711 2064 2-492 2:797 
1:316 1708 2:060 2:488 2:787 
1315 1706 2056 2-479 2179 
1314 1-703 2052 2473 2:771 
1313 1-701 2048 2467 2-763 
131]  L699 2045 2-462 2-756 
1310  L697 2042 2-457 2-750 
1:206 1-671 2:000 2:390 2-660 
1:289 1-658 1-980 2-358 2-617 
1282 1-645 1:960 2:320 2-576 


R. A. and Yates, F., Statistical Tables for 
Oliver & Boyd, 6th edition 1963, 
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x 
Corrected sum of squares = Y; as 


= 25,236 — 25,088 
= 148* 


We see, then, that a computational formula for the best estimate of 
the population variance 62 may be written as 


2 2 
X z4- 2 +5x- P 


(6) 


n tn,—2 
x, denoting any score in sample 1, of size n,, and X2 any score in sample 2, 


of size n}. This may be used in the formula for the estimated standard 
error of the difference between two means (formula 3) to give 


) (Lo a. UT amme 
|e Se Se - 
Ó(u1— 3) = 1 2 (+2) (7) 


n +n,—2 n n 


Finally, the formula for the ¢ ratio 


(formula 4) may be rewritten in the 
computational form 


i= M,-M, 
(x? Gu (8) 
= Ga 
n n ! i 
n ctn,—2 (-+2) 
* When, 


as in the present illustration, all 
ion could be ma 

Convenient number for all the scores before st 
to give 13, 11, 8, 7, 5, 2, 2 and 0, 


= x—50) to denote any coded score— Xx Xe = 48, 
Exi = 436, but r1 839 = 148 as before. 
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2.3 The assumptions underlying the test 


It is as well to remind ourselves of the assumptions underlying the use of _ 
the ¢ distribution in evaluating the statistical significance of the obtained 
difference between means. These assumptions may be stated as follows: . 


1. The scores in each of the two populations from which the groups are 
randomly selected are normally distributed (the assumption of 
normality of distribution). 

2. The variances of the scores in the two populations are equal (the 
assumption of homogeneity of variance). 

3. The two groups are selected independently. 


There is usually no difficulty in being sure whether or not the last assump- 
tion holds. Most experiments are planned in such a way that the selection 
of one group in no way influences that of the other. Again, when related 
&roups are incorporated into the design, a modification of the procedure 
outlined in section 2.1 is readily available (see, for example, Lewis 1967). 
(The individuals in the groups could be ordered into ‘blocks’ as described 
in the randomized-blocks design, chapter 5.) It is the first two assump- 
tions which now merit more detailed consideration. 

It is of little use testing the assumption of normality by the standard 
tests of skewness and kurtosis, since these tests are insensitive for small 
Samples; i.e. the hypothesis of normality would not be rejected unless the 
Skewness or kurtosis was very pronounced. Similarly, testing the assump- 
tion of homogeneity of variance by the F test (discussed in section 2.4) is 
also of limited value, this test too being insensitive. Of course, even if more 
Sensitive tests were readily available, failure to detect a real difference 
would not mean that the assumption were necessarily true. A more 
Promising line of investigation is that of demonstrating the effect of 
Violating the assumptions from empirical studies of sampling from 
Populations with known characteristics. A study by Boneau (1960) is of 
this kind, 

Large numbers of ts were calculated—by an electronic computer— 
from the difference between the means of samples drawn at random from 
Populations which were (a) normal, (b) platykurtic (actually rectangular) 
and (c) strongly skewed (actually J-shaped with a skew to the right). The 
means of all the populations were the same, but some had variances four 


‘times as large as the others. For different combinations of populations, 


e 
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and for different sample sizes, the percentages of fs exceeding the 
5-per-cent and 1-per-cent levels were found. . 

When both populations were normal but had different variances 
(one four times the other) Boneau found that with samples of 5, 6:4 pen 
cent of the ts exceeded the theoretical 5-per-cent limits, and that with 
samples of 15 the corresponding figure was 4-9 per cent. With different 
sample sizes, however, (one of. 5, the other of 15) only 1 per cent of the ts 
exceeded the 5-per-cent limits when the smaller samples were drawn from 
the population with the smaller variance, and 16 per cent of the ts 
exceeded these limits when the smaller samples were drawn from the 
population with the larger variance. A similar pattern was found when the 
percentage of ts exceeded the 1-per-cent limits (see table 2.2). Evidently 
with populations of unequal variance the discrepancies become important 
only when the sample sizes are unequal. 

When the two populations were strongly skewed (but identical), 
slightly too few of the calculated ts exceeded the theoretical limits, 3-1 per 


cent exceeding the 5-per-cent for samples of 5, and 4-0 per cent for 
samples of 15. This is a re: 


the calculated ts exceeding the 5. 


Finally, when sampling from two different! 
such as the normal and skewed populations, 


less, it would obviously be fo 
cance to the means of small sa; 
differently skewed. 


olhardy to apply one-tailed tests of signifi- 
mples drawn from populations known to be 


, and if the two population distributions have the 
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Table 2.2 Percentages of ts exceeding the theoretical 5-per-cent and 
1-per-cent limits in Boneau’s study* Í 


A. Sampling from distributions with the same shape 
cde Tb e ddp a a SO 


Variances Obtained percentage at 
TP equal (E) 
Distribution or Samp “ 5 per cent 1 per cent 
different (D) SERES 
(ratio 4:1) 
Moms E Both 5 6:4 1:8 
exe E Both 15 4:9 1-1 
odii D 15 and 5 1:0 0:1 
o. D 5 and 15 16:0 6-0 
ana E Both 5 3: 0:3 
P ewed E Both 15 4-0 0:4 
E Kurti E Both 5 54 10 
E atykurtic E Both 15 50 1:5 
latykurtic D Both 5 71 1:9 
ee” ee 


B. Sampling from distributions with different shapes 
(equal variance) 


Obtained percentage at 


5 per cent 1 per cent 
Distribution Sample Total Targen Total Larger 
sizes tail tail 


Skewed Both 5 71 5:6 139 is 
Skene and normal Both 15 5:1 42 1-4 1:2 
Skewed and normal Both25 46 2:7 1:3 1-1 
PN and platykurtic BothS 64 50 33 

ewed and platykurtic Both 15 5:6 3:9 1:6 


Skewed and normal 


* 
perma dabted from the table on p. 61 of Psychological Bulletin, 57, 1, 1960 by 
If E. of the author and the Psychological Bulletin. 
Variance ; equal, the size of the samples from the distribution with the larger 
Ce is placed first, 
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unless the population variances are markedly unequal. It is the combina- 
tion of unequal sample sizes with (suspected) unequal population variances 
that must be guarded against. Procedures for dealing with this situation 
are outlined by Fisher and Yates (1963) and Cochran and Cox (1957). 


2.4 The F test of significance 


The two groups of scores in table 1.1 could also be treated by the more 
general technique of the analysis of variance. It is more general in so far as 
it can be applied to any number of groups, not just to two. It consists of 
partitioning the total variation into two or more distinct sources. Thus, 
the variation of the scores in table 1.1 would be partitioned into two 
Sources, that between groups and that within groups. 


Table 2.3 Alternative analysis of the data in table 1.1 


Source of Sum of Degrees of 


variation Squares freedom Mean square 
Between groups 49-00 1 49-00 
Within groups 264-00 14 18-86 
Total 313-00 15 


ee 
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We see that the sums of the squared deviations—written more shortly 
as the sums of squares—for between groups and within groups have been 
added to give a total sum of squares. This, which comprises the total 
Variation of the sixteen scores irrespective of grouping, could also have 
been obtained as the sum of the squares of the deviations of all the scores 
from the overall mean (54-25). It is, in fact, the corrected sum of squares of 
the sixteen scores (see section 2.2). The degrees of freedom (see footnote, 
P- 28), which are 1 for between groups (since there are only two groups) 
and 14 for within groups (i.e. 7 for each group of eight scores), are likewise 
added to give 15 for the total. This is clearly correct, since there are sixteen 
Stores in all. The analysis thus resolves both the sum of squares and 
degrees of freedom into separate components. 

The mean squares are obtained by dividing each sum of squares by the 
appropriate degrees of freedom. That for within groups (18-86) has already 
been obtained (table 2.1) as an estimate of the population variance 
assumed to be common to each of the two groups. On the assumption that 
the means of the two populations also do not differ, the mean square for 
between groups (49-00) provides a second, and independent, estimate of 
this same variance, one based solely on the two obtained group means. 
Tf, 9n the other hand, the two populations though still having the same 
Variance have different means, the mean square for between groups then 
qumátes the common population variance plus a component resulting 
from the difference, Obviously, then, the crucial question is whether the 
divergence of the two mean squares is great enough to indicate a difference 

€tween the population means. 
he question is answered by taking the ratio of the two mean squ 
„own as the F ratio. (It was formerly known as the variance ratio, 
Was renamed F in honour of R. A. Fisher.) This is because when two 
estimates of a population variance have been independently obtained from 
S Pane Population—or, what is essentially the same, from different 
m, Pulations with the same mean and the same variance—then, uon 
dist, the distribution of scores in the population(s) is normal, the sampling 

Tibution of the F ratios is known and is given by 


ares, 
and 


yiv pio 
Dg cd 


"E — (9) 
a(Z 2) (vy Ft vg) te 


p= 
2:2 


38 


Experimental Design in Education 


Statistical table 2A Distribution of F: S-per-cent points* 


Values of F that would be exce 
random samples of various siz 


Degrees of freedom for smaller variance 


1 


e combinations 


eded in 5 per cent of pairs of 


Degrees of freedom for larger variance 
2 e| 4 5 


8 12 24 c 

1 1614 199-5 215-7 224-6 230-2 234-0 238-9 243-9 249-0 254-3 
2 18:51 19-00 19-16 19-25 19-30 19-33 19-37 19-41 19-45 19-50 
3 1013 9-55 928 9-12 9.01 894 8-84 8-74 8-64 8°53 
4 TIL 694 659 639 626 6:16 604 591 5-77 563 
5 661 5-79 541 519 505 495 4.82 468 4-53 4-36 
6 599 514 476 453 439 428 445 400 384 3-67 
7 $59 474 435 412 397 3-87 3-73 3.5] 34] 3-23 
8 — 532 446 407 3-84 3-69 3-58 344 3-28 312 29 
9 51D 426 386 3-63 348 337 323 3-07 290 271 
l0 496 410 3-71 348 3-33 322 3-07 291 2.74 254 
ll 484 398 3-59 3-36 320 309 2.95 2.79 2.6] 2-40 
12 475 $88 349 326 341 300 2-85 2.69 2.50 2:30 
13 467 380 341 318 3-02 2.92 2.7] 260 242 221 
14 460 374 334 311 296 2-85 2.70 2.53 235 213 
15 454 368 329 306 290 279 2-64 248 229 20] 
l6 449 3-63 324 301 285 2.74 2.59 242 224 2-01 
17 445 359 320 296 281 2-70 2.55 238 219 196 
I5 14 355 316 293 3-77 266 2.1 234 215 192 
19 438 $52 313 290 244 2-63 248 231] 21] 188 
2! 435 349 310 247 271 240 245 228 208 1454 
aL 432 347 307 284 2468 2.57 242 2325 205 1al 
22 430 344 3-05 232 2-66 2.55 240 223 Se 1-78 
23 428 342 303 2-80 264 2.53 2.38 220 200 176 
24 426 340 301 278 2-62 23] 236 248 198 173 
25. 424 338 299 2-76 240 249 234 215 la [pl 
2. 129 231 398 274 249 24r 335 215 Los agi 
27 421 335 296 273 25] 246 9.29 243 193 161 
28 420 334 295 2-71 2.56 244 229 2-12 1-91 165 
29 418 3-33 2:93 2-70 2:54 2-43 2:28 2-10 1:90 1:64 
30 117 $32 292 269 2353 2.42 22] 219 1-39 1-62 
40 408 323 284 2.61 2.45 234 218 2.00 1-79 1:5! 
60 400 315 2-76 2-52 2.37 295 218 1-92 1-70 139 
120 392 307 268 245 229 2.17 240 133 1-61 125 
A UA 299 260 237 221 209 194 17s psz 10 


* Taken from table 5 
Biological, Agricultural and 
by permission of the autho 


of Fisher, R. A. 


rs and publishers, 


and Yates, F., Statistical Tables fo 
Medical Research 


, Oliver & Boyd, 6th edition 196? 


The t and F Ratios s 39 


Statistical table 2B Distribution of F: 1-per-cent points* 
= EEE 
Values of F tha i i 

t would be exceeded in 1 per cent of pairs of random 
Samples of various size combinations 


Degrees of freedom for larger variance 
3 4 5 6 8 12 


1 24 ice) 


1 4052 4999 5403 5625 5764 5859 5981 6106 6234 6366 
2 9849 99:00 99-17 99-25 99.30 99:33 99:36 99-42 99-46 99-50 
3 3412 30-81 29-46 28-71 2824 27-91 27-49 2705 26-60 26-12 
4 2120 18-00 16-69 15-98 15:52 15-21 14-80 14:37 13-93 13-46 
Š 1626 13-27 12-06 11-39 1097 10-67 10-29 989 9.47 9:02 
6 13741092 978 915 875 847 8:10 772 731 6-88 
7 1225 9-55 845 785 7-46 719 684 647 607 5:65 
$ 1126 865 7-59 701 663 637 603 67 528 4-86 
10-56 8-02 699 6-42 6:06 580 547 511 473 43l 


Y 

Y 
E 1 10-04 7-56 6-55 5:99 5.64 539 506 471 433 3-91 
8 2 9-65 720 622 567 532 507 474 440 402 3:60 
S D 933 693 5:95 541 506 492 450 416 278 270 
& 13 907 6-70 574 520 486 462 430 396 275 216 
Š 15 8-86 6-5] 5-56 5-03 469 446 414 3:80 3:43 300 
S lS 868 636 542 489 456 432 400 367 320 777 
N 17 8.53 623 529 4-77 444 420 3-89 355 318 275 
y jg 240 611 518 467 434 410 3-79 3:45 3.08 2-65 
E 18 828 601 509 4:58 425 401 3:71 337 300 2.57 
$ 20 8-18 5-93 5.0] 4-50 417 3:94 3:63 3:330 292 249 
& 30 810 585 494 443 410 3:87 356 327 2.86 242 
92 8-02 5.78 4:87 4-37 404 381 3:51 317 2:80 2-36 
g 23 194 572 482 431 3:99 3776 345 312 275 2-31 
P 34 188 566 476 426 394 3-71 3-41 3.07 2-70 226 
$ Ss 782 561 472 422 390 367 3-36 3-03 2:66 221 
^ 17 557 468 418 3:86 3-63 332 2:99 262 217 
27 7-2 5-53 464 414 3-82 3°59 329 296 258 2:13 
2g 1:68 549 460 411 3-78 3:56 3.26 293 2:55 2-10 
28 764 545 457 4.07 375 353 323 290 275 2-06 
30 760 5-42 4:54 4-04 3:73 3:50 3:20 2:87 2:49 2:03 
4p 156 539 451 402 370 347 3:17 284 241 201 
O TA 518 431 383 351 329 299 266 208 1-80 
Do 708 498 413 365 334 3-12 292 2-50 2:12 1-60 
685 4-79 3-95 3-48 3-17 2:96 2:66 234 1:95 1:38 
02 2-80 251 218 1:79 1-00 


Taken from table 5 of Fisher, R. A. and Yates, F., ibid., by permission of 


* 
th 
€ authors and publishers. 
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in which B denotes the beta function as defined in standard texts on the 
theory of functions.* This distribution is known as the F pm 
We see that it depends on v, and v, as well as on F, v, and v; are t 

degrees of freedom on which the two variance estimates are based, 
being the degrees of freedom corresponding to the larger of i the tw 

variance estimates. There are, therefore, a number of F distributions, one 
for each combination of the values of yı and vz. Generally the distribution 
of F—like that of tis unimodal, but is not symmetrical (except when 
Y, = v3), its main features being as shown in figure 2. However, the right- 


ôi 2 

hand tail of the sampling distribution of the Fs defined by E (61 and 62 
2 

being variance estimates obtained from sample sets 1 and 2) is the wo 

2 

as the left-hand tail of the sampling distribution of the Fs defined by a 


need therefore be considered. Statistical 
of F that will be exceeded in 5 per cent 
m samples of various sizes, the F ratio 
the larger of the two estimates in the 
being necessarily greater than 1. The 


estimate (v;) fix the column of the table, 
and those for the smaller estimates (v5) the row. 


able 2.3), therefore, we have, placing the 


49-00 
“groups mean square in the numerator, F = 18-86 = 2:60. 
This is less than the rea 


(larger) between 


* The beta function is related to the gamma function (see the z distributioD: 
"c 
formula 5) by »G. 5) RS 2 
2/2 Vitv2 
Ti 
E 
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chance one, and there would be no point in evaluating aF E bes ne 
The only alternative to the null hypothesis of no difference m (ar 
Population means (which implies that the F ratio estimates nar that F 
of some difference between the population means (which implie: ee 
defined as the ratio of the mean square for between xus 
Within groups estimates a quantity greater than unity). We sho mubs 
too, that the one-tailed nature of this test of significance now 
Statistical tables 2A and B relevant as they stand. 


— V ee ee re 
Values of F 


p of 
Figure 2 The shape of the F distribution when beer i ae 
freedom for the larger variance estimate are more tt j 


There is an underlying identity between the two tests i dece 
e Previously obtained t = 1-61. The square of this Mais lly the same. 
> apart from rounding errors. The two tests used are essen F ratio can be 
* f test, however, is limited to two groups, whereas at HL 
derived from an analysis of variance from any number of groups. 
trations are Provided in the following chapters. 


25 The Computation of F 


The comp 
from th 
of Squa; 
Sübtrac, 


utations for table 2.3 may be performed by E pé, 
© data (a) the total sum of squares and (b) the ee buius by 
tes. The within-groups sum of squares would then rines d 
tion (total sum minus between-groups sum). ; 
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between-groups sum could be obtained directly from the group totals, i.e. 
without calculating the group means. Also the total sum of squares, being 
the corrected sum of squares of all the Scores, would be best obtained by 
the method described in section 2.2. The steps are as follows: 


1. Sum of all scores, Xx = 63+61+ +++ 447 
(see table 1.1) = 868 


2 
2. Correction term, (3%) = E 
n 


= 47,089.00 


n 
= 63346124. ... +47? — 47,089-00 
= 41,402-00 — 47,089-00 


2 
3. Total sum of squares — xx (23 


= 313-00 
4. Between-groups (Group sum)? (> x)? 
sum of squares = Group size ^; 
448? 420? 
= -g +5 -47089-00 
= 47,138-00— 47,089-00 
= 49-00 
5. ties: 
= 313-00— 49.00 
sum of squ. 
po = 264-00 


that particular group before addi 
The procedure is identical with 
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2.6 The assumptions underlying the F test 


As with the t test, the assumptions of normality of distribution and 
homogeneity of variance underly the F test of significance. In other words, 
evaluating the significance of differences from the readings of statistical 
table 2 assumes that all the groups are randomly selected from normally 
distributed populations and that the variances of these populations are 
equal. A number of empirical studies have been made of the effects of 
violations of these assumptions (e.g. Goddard and Lindquist 1940, 
Cochran 1947, Norton 1952). Norton’s study, which is reported by 
Lindquist (1956), is a very thorough one and worthy of special note. 
Norton sampled from six different populations of cards, three sym- 
metrical (one normal, one leptokurtic and one platykurtic) and three 
skewed (one slightly skewed, one markedly skewed and one J-shaped). 
From each of these groups of cards were randomly selected, each group 
having the same number of cards. A F ratio from the sums of squares for 
between groups and within groups was then obtained. Repeated selections 
of the same number of groups gave an empirical distribution of F ratios. 
For each of the populations, and for different numbers of groups and 
group sizes, the percentage of Fs exceeding the theoretical 5-per-cent and 
l-per-cent limits were found. The results are shown in table 2.4A. 
Evidently the correspondence with the theoretical distribution is in all 
Cases a close one, with ‘flatness’ or ‘peakness’ in the form of the distribu- 
tion more disturbing than lack of symmetry. The F test appears to be 
extremely insensitive to lack of normality—and especially lack of sym- 
metry—in the population, given that the same form of distribution occurs 
In all the populations sampled. ; , 
Norton also investigated the effect of differing population variances, 


normal population variances of approximately 25, 100 and 225 being 


Selected. In the results shown in table 2.4B, one group from each of these 
The discrepancies between the 


three populations was selected each time. 
empirical and theoretical distributions of F are still fairly small. Sub- 
Stantially the same results were obtained when the forms of the distribu- 
tions (but not the variances) differed. It was only when samples were 
selected from population distributions differing both in form and variance 
that the discrepancies become more pronounced. For the results recorded 


in table 2.4C the differences in variance were extreme—one population 
Variance being more than forty times another!—as well as the differences 
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Table 2.4 Percentage of Fs exceeding the theoretical 5-per-cent and 
1-per-cent limits in Norton’s study* 


A. Sampling from distributions with the same shape and equal 
variances 


Number Group Obtained percentage at 


Distribution of groups size 5 per cent 1 per cent 
Leptokurtic 3 3 7-83 2-76 
Leptokurtic 4 5 6-56 1:63 
Platykurtic 3 3 6-07 1:77 
Moderately skewed 4 5 5-15 1-32 
Markedly skewed 3 3 4:77 0-80 
Markedly skewed 4 5 4-76 1-00 
J-shaped 3 3 


4:80 1:00 


.B. Sampling from distributions with the Same shape and 
unequal variances 


Distribution Number Group Obtained Percentage at 

of groups size 5 per cent | per cent 
Normal 3 3 7-26 2°13 
Normal 3 10 6:56 2-00 


C. Sampling from distributions with different Shapes and 
unequal variances 


Distribution Number Group Obtained percentage at 
ofgroups size 5 per cent 1 per cent 
Normal, 
moderately skewed, 4 3 10-02 3:57 4 
markedly skewed 4 10 8-10 2-93 
and J-shaped 


by permission of the publishers. , 
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in form. Such a combination would be expected very infrequently in 
practice. If we have reason to believe it does occur, statistical table 2 would 
not be rendered valueless. Thus, for an obtained F exceeding the reading 
in the 5-per-cent table significance at only 10 per cent could be claimed. 

We may reasonably conclude that the F test, like the ¢ test, is remark- 
ably robust. It is insensitive both to lack of normality in the populations 
and to differing population variances (unless these differences are extreme 
and are combined with marked differences in form). Because of this 
robustness it is, in fact, unusual for any check to be made on the normality 
of distribution unless the departure from normality in the groups sampled 
is seen to be extreme. Again, it is often unnecessary to test for homogeneity 
of variance. If, however, an inspection of the scores suggests a lack of 
homogeneity—a pronounced difference in the group ranges, for instance— 
a test devised by Bartlett (1937) may be applied. A description of this is 
postponed until later (section 3.6). 
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Chapter 3 Designs with Randomized Groups 


3.1 A simple methods experiment 


We begin by considering a basic design, that of randomized groups as 
exemplified by a simple methods experiment. Although this design in itself 
has very limited application in educational research, it is of fundamental 
importance in that it provides the basis for designs which are of direct 
usefulness. It is vital, therefore, that a thorough understanding of the 
randomized-groups design be achieved in the first place. 

Suppose that we are interested in the relative effectiveness of four 
methods of teaching a certain topic in a particular school. We select four 
random groups of pupils, and allocate each group to one of the methods. 
The pupils, we will suppose, are all from one age group, the age at which 
we wish to investigate the effectiveness of the methods. Otherwise the 
groups are selected solely by chance, e.g. by the use of tables of random 
numbers. Again, if the groups are taught by different teachers, the alloca- 
tion of the teachers to the groups must be solely by chance. Alternatively, 
all the groups could be taught by the same teacher. After the lesson, or 
series of lessons, the same test is administered to all four groups. We will 
suppose in this illustration that there are six pupils in each of the groups, 
and that the test scores obtained are as shown in table 3.1. 

A comparison of the mean scores for the four groups suggests dif- 
ferences between the effectiveness of the methods. The F ratio, provided by 


an analysis of variance, enables us to test the significance of the differences. 
We proceed with the calculation as follows: 


1. Overall sum of scores, Zx = 117+1504+ 165+ 168 
= 600 


(me overall mean is therefore = = 250) 
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2 
2. Correction term, (Zx)? _ 600x600 
N 24 


15,000 


M 


M 


Q6 423^ 19? 4- =+: 422?) 
Sum of 24 terms 
— 15,000 
= 15,718 — 15,000 
= 718 
1172 150? 165? 168? 


4. Be ? oe 
tween-groups sum of squares = ó T 6 6 6 
— 15,000 
= 15,273—15,000 
= 273 


3. Total sum of squares 


5. Within-groups sum of squares = 718—273 " 
= 445 


Table 3.1 Scores of four method groups, with six 
pupils in each group 


Method groups 
1 2 3 4 


2611-32 37 35 
23 * X28 30 29 
19 26 27 29 


17 24 26 28 
17 21 23 25 
15] 19 22 22 


Group sum 117 150 165 168 
Group mean 19-5 25:0 27:5 28:0 


The total sum of squares (718) gives a measure of the variation of the 
Scores not taking any account of their separation into groups. It is, in fact, 
the sum of the squares of all the scores expressed as deviations from the 
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overall mean (25-0). The degrees of freedom are 23, 1 less than the total 
Scores. 

NL ——— sum (273) expresses only the variation between 
groups. It gives a measure of the variation that would result if each vue 
were replaced by the mean of the group to which it belongs. It is, in fac > 
the sum of the squares of the four group means, expressed as deviations 
from the overall mean, increased sixfold (since there are six scores in each 
group). Thus, the sum of the squared deviations of the group mean is 


(19:5—25-0)? + (25:0— 25-0)? + (27-5 25-0)? (28-0 —25-0)? 
5:5* + 0? + 252 4 3? 


ug 


45:50 


and this, multiplied b 


y 6 gives the between-groups sum of squares, 273. 
The degrees of freedo: 


m are 3, 1 less than the number of groups. 


Table 3.2 Analysis of variance of the data in table 3.1 


Se 


guys Sum of ^ Degrees 
Source of variation f 8 Mean square 
Squares of freedom 


Between groups (methods) 273 3 91:00 - a+ 
Within groups 445 20 22:25 
Total 718 23 


— s c E 


The within-groups sum (445), 
also have been obtained from first 
Squares of all the scores expresse 


Obtained above by subtraction, could 
principles (see p. 36) as the sum of the 

d as deviations from their own group 
mean. The degrees of freedom, also obtained by subtraction (23—3 = 20); 
could also be derived from summing the 5 degrees of freedom for each of 
the separate groups. 


The three sums of Squares together with their degrees of freedom are 


set out in table 3.2, The resulting mean Square, or variance estimate, for 
between groups and within groups is also shown. The ratio of the mean 
Square for between 


groups to that for within groups is then obtained aS 
91-00 
= 25 = 4-09. This exceeds the Teading in statistical table 2A for 


3 and 20 degrees of freedom (3-10). We conclude that the differences 
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among the group means are significant at the 5-per-cent level. It is improb- 
able that chance alone accounts for the obtained differences between the 
effectiveness of the methods. 

Itis well to appreciate the limitations in our conclusion. This conclusion 
—that the differences among the methods may reasonably be considered 
non-chance—relates only to the pupils of the given year-group within the 
particular school, and to the methods as taught by the teachers in that 
School. Indeed, if all the methods have been taught by the same teacher, 
the significance of the method differences relates only to the methods as 
taught by that teacher. And if, on the other hand, the groups have been 
taught by different teachers randomly selected from a panel of teachers in 
the school, the significance of the differences relates to the methods as 
taught by members of that panel. To generalize to pupils from other 
Schools (who may have grown used to different teaching methods— 
methods that would affect their response to the particular methods 
investigated) is not permissible. This is why a randomized-groups design, 
Such as that employed in a simple methods experiment, has few direct 
applications in educational research. ; 

A second limitation concerns the nature of the test of significance. It is 
an overall test, one assessing the differences among the groups as a whole. 
It does not follow that the difference between any particular two group 
means is significant at the same level. Differences between particular 


groups must be considered separately. 
3.2 Comparisons of particular groups 


Once the F ratio has indicated real differences among the group means, the 
Significance of the differences between any two group means could be 
tested by the ¢ ratio. We need not, however, proceed again from the 
beginning. The within-groups mean square can usually be taken as an 
estimate (and one based on all the available data) of the population 
variance for each of the groups. (This implies that the assumption of 
homogeneity of variance extends over all the groups, not just the particular 
two groups we are considering.) The standard error of the difference 
between the group means is then estimated by 


= (10) 


Ó(1—M2) = n 
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where 6? is the within-groups mean square, and 7 is the size of the group. 
The formula corresponds to formula (3), p. 27, where both n, and n; are 
replaced by n. In this illustration, then, the standard error is evaluated as 


2x 22-25 ; 
af = 6 — 2-72. With d referring to an obtained difference in means, 


d : ; 
we would then have 7 = IT and this would be compared with the t 
ratio for 20 degrees of freedom (the same number as that on which the 
estimate ô? is based) in statistical table 1. . 
Alternatively, starting from statistical table 1 we see that a ¢ ratio of 
3-10 is necessary for significance at the 5-per-cent level. Putting ? = 3:10, 


we then have 3-10 = 25 whence d = 3-10 x 2-72 = 8-43, so giving the 
minimum difference between group means for significance at the 5-per-cent 
level. Of the six differences between the four group means shown in 
table 3.1, only one, that between the means of groups 1 and 4, exceeds this. 
Only this difference, therefore, is significant at the 5-per-cent level. 


selecting the largest difference and testing its significance by the ratio. 
The largest difference (and also, for that matter, other differences) might 
well be significant at, say, the 5-per-cent level without the overall differences 
being significant at this level.* We could justifiably test a particular 
difference in group means (irrespective of the significance of the overall 


t f groups is large, testing the significance of parti- 
cular differences by the t test in the way described is open to criticism, even 
When the significance of the overall differences (at the same level) has first 

tistical table 1 gives the probability of obtaining various values of £ 


* Sta 
from single random sample pairs, not the probability of obtaining values of f 
as the largest of a number of random sample pairs. 


Designs with Randomized Groups 3 


been established. This is because, just as when the evidence indicates no 
Teal differences (the overall null hypothesis being accepted), some of the 
Separate differences must none the less be expected to exceed the minimum 
difference for significance—s per cent of the differences exceeding the 
minimum difference for 5-per-cent significance—so when real differences 
are indicated, slightly more than the allowed percentage of the separate 
differences must be expected to exceed the minimum difference as calcu- 
lated above. In other words, a slightly more stringent test for the signifi- 
cance of the separate differences would be desirable. Readers who may 
have to deal with a large number of groups in this way should consult the 
tests provided by Tukey (1953), a modification of which is described by 
Snedecor (1956, pp. 251-3), and Scheffé (1953). Discussions by Federer 
(1955) and Ryan (1959) are also of interest. 7 

The qualification just discussed does not, of course, apply to special 
comparisons suggested by the nature of the methods themselves; nor does 
it apply if a comparison between a particular method and a combination 
of other methods is suggested. Suppose, for example, that all four methods 
of the present illustration relate to the effect of practice and coaching on 
Improving test performance, and that method 1 is unique in that it involves 
no special preparation of any kind. (The method 1 group, in other words, 
Serves as a control.) We will suppose that the methods are as follows: 

Method 1: no practice or coaching. 

Method 2: practice only. 

Method 3: practice plus slight coaching. 

Method 4: practice plus intense coaching. . 
It would then be justifiable, irrespective of the value of the F ratio, to 
compare method 1 with methods 2, 3 and 4 combined. This involves a 
Partitioning of the between-groups variation shown in table 3.2 into two 
Components, one based on the difference between method 1 and the other 
Methods combined, and the other on the differences of the other methods 
among themselves For the former component the sum of squares is 


117? | (150+ 165+168)" _ 15,000 
6 18 

= 15,242 — 15,000 

= 242 
This sum has 1 degree of freedom, being based on the difference between 
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two means, that for group 1 and that for groups 2, 3 and 4 combined. For 
the differences among methods 2, 3 and 4 the sum of squares is 


150? 165? 168 (1504-165 4- 168)? 


6 6 6 18 
= 12,991-5—12,960-5 
= 31:0 


More shortly, this could have been obtained by subtraction, 273—242. 


The degrees of freedom are 2, as the sum is based on the difference between 
three groups. 


Table 3.3 Further analysis of the data in table 3.1 


= ee eee 
Sum of Degrees 


Source of variation 
f squares of freedom 


Mean square 


Between groups 1 and groups2,3and4 242 1 242-00 
Among groups 2, 3 and 4 31 2 15:50 
Within groups 445 20 22:25 
Total 718 23 


-p A €—À——M e em 


The full analysis is set out in table 3.3. We see that the greater part of 
the former between-groups variation (table 3.2) now appears between 
group 1 and the rest, the ratio of the mean square to that for within groups 


; 242-00 eT 
being F= 2235 ^ 10-88. This exceeds the reading in statistical 


table 2B for 1 and 20 degrees of freedom (8-10). The difference is significant 
at the 1-per-cent level. We may reasonably conclude that the methods 
involving practice have a real superiority over method 1. 

The variation among methods 2, 3 and 4, on the other hand, is small 
and statistically insignificant (r = om < 1). The results fail to estab- 
lish a real difference (overall) among the methods involving practice- 
Despite this, however, we could well argue for a test of significance 
contrasting method 2—the one method of the three not involving coaching 
—with methods 3 and 4 combined. Such a test would involve a further 
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Bee Ea the between-groups variation, one in which the sum of 
Eo eek os among groups 2, 3 and 4, 31-00, is split into two 
Mr res ed ne component would be based on the difference between 
Dein nd methods 3 and 4 combined, and the other on the difference 

en methods 3 and 4. For the former component the sum of squares is 


: 150? (65-1689 _ (150+ 165+ 168? 


6 12 18 
= 12,990-75 — 12,960:50 
= 30:25 


Thi 

iuo b has 1 degree of freedom, being based on the difference between 

E ciel that for group 2 and that for groups 3 and 4 combined. For 
atter component the sum of squares is 


165? 168? (165+ 168)? 


B 5 12 
= 9241-50 —9,240-75 
= 0.75 


by subtraction, i.e. 


Agai ; 
gain, this could have been obtained more simply 
dom. The complete 


31-00— ; 
ub -3025. This sum also has 1 degree of free 
ysis is set out in table 3.4. 


T j 
able 3.4 Still further analysis of the data in table 3.1 


Sum of Degrees Mean square 


Sourc jatii 
sl sina squares of freedom 
B 
Bowen group 1 and groups 2, 3 and 4 24209 1 242-00 
Hamer group 2 and groups 3 and.4 3025 1 3025 
a qund 0:75 1 0:75 

n groups 445-00 20 22:25 
T 

oal 7800 23 
d 4 now appears between 


mei all the variation among groups 2, 3 an 
P 2 and groups 3 and 4. Even so, the ratio of the mean square to that 
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for within groups is only F — = = 1:36, as compared with the ratio 
of 4:35 necessary for significance at the 5-per-cent level (see statistical 
table 2A). Clearly the effect of coaching additional to that of practice has 
not been established. And of even less statistical significance is the effect 


of intensive coaching, since the corresponding ratio for the difference 
0:75 


between groups 3 and 4 is F = 2225 <1. 
Effect being tested Total variation 
Overall 
method differences... ++» Between groups Within groups 
1, 2,3 and 4 
IPractics ors suse scons as Group 1 v Between groups 
groups 2, 3 2, 3 and 4 
and 4 
Coaching............ Pie sine tied Sines Group 2 v 
groups 3 
and 4 


Intensity of 
Coaching ....., 


Figure 3 Breakdown of variation in a methods experiment. 
(Group 1 receives Neither i ing, group 2 receives 
practice, group 3 receives i 

receives practice plus int 


The complete breakdown of the variation is shown in figure 3. We 
should emphasize that this bri 
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3.3 The general model 


We must now make an explicit statement of what is involved in the analysis 
of variance used in the methods experiment described. A basic assumption 
is that any pupil's score can be regarded as made of three parts, namely: 


1. A part common to all the scores. 
2. A part characteristic of the particular method (and therefore 
common to the scores of all pupils in the particular method group). 
3. A part characteristic of the particular pupil. 
It is also assumed that the parts are independent and additive. We can 
therefore express the assumption in algebraic shorthand by saying that 
the score x; of the jth pupil in the ith method consists of independent 


components as follows: 
Xi M-tA,teij (11) 


where M is a component common to all the scores; 
A, is component common to all scores in method i; 
and e;jis a component specific to pupil j of method 7. 
(In the particular experiment described i would run from 1 to 4, there being 
four groups in all, and j would run from 1 to 6, there being six scores in 


each of the groups.) 

Without loss of generality we may 
mean of the scores for method i—i.e. the mea’ 
tion of scores that would result if all the original population of pupils (the 
population from which all the groups Were originally selected) had 
undergone the ‘treatment’ of method i—and express this as a deviation 
from the mean of the population means for all the methods.* It then 
follows that, summing for all the methods, 34:20. Equally without 
loss of generality we may put M equal to the mean of the population 


means for all the methods. 
To revert to table 3.1, therefore, t 
estimate of M, while the separate group means, 
* It is essential to distinguish between the hypothetical populations of scores, 
one for each of the method groups, and the original or parent population of 
iginally selected. The F test may then 


pupils from which all the groups were ori J 
be regarded as a test of the tenability of the null hypothesis that all the treatment 


populations have the same mean and the same variance, this common variance 
being estimated by the within-groups mean square. 


put 4; equal to the population 
n of the hypothetical popula- 


he overall mean 25:0 would be an 
expressed as deviations 
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from 25-0, would be estimates of the AS, ie. 19-5—25-0 = —5-5 would 
be an estimate of 4i, 25:0—25-0 = 0 would be an estimate of Az 
27-5—25-0 = 2-5 would be an estimate of Az, and 28.0—25.0 = 3-0 
would be an estimate of 44. 

The part e;; represents the element of randomness essential to every 
design. Since M 
for method i, it 
i in turn—as a 
with respect to 


—19-:5 = 6:5 estimates eii 


19—19-5 = —0:5 estimates €14, and so 
on. Note that the six estimates of eii, 


Scores from any other gro 
provided by the within- s 


) : ?, it follows that these 
variances must differ only by chance, 


groups mean square estimates g?, It 


‘Ween-groups mean Square estimates g? plus a 
component due to the variation of the Population means 4,. And with k 


: n> A} 
groups in all, and z scores in each group, this added component is —- 


In other words 


F estimates 


The null hypothesis of no differences among the population means is that 
X 41 = 0 (since the Ajs are in deviation form, no differences imply that 


| 
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each A; is zero), so that the added component is also zero. If the hypothesis 
is true, therefore, F estimates 1. Conversely, if the hypothesis is false, F 
estimates a quantity greater than 1. In practice, of course, we do not know 
what F estimates. We only know F. Statistical table 2, however, provides 
the probability of values of F greater than 1 arising from sampling when 
the null hypothesis is true. If it so happens that the obtained F is less than 
1, the between-groups mean square being less than that for within groups, 
the difference from 1 is necessarily due to chance and the null hypothesis 
is accepted. 

24 

k—1 
be the estimated variance of the population—if such a population could 
exist—from which A, Ao, °° * Ax is itself a sample. It would not, in 
fact, be very meaningful to regard the Ajs of a methods experiment as a 
random sample of a population (and theoretically an infinitely large 
population) in this way. The particular methods compared would not, of 
course, be the only possible methods, but we could expect them to be—in 
the opinion of the experimenter, at any rate—the best of a relatively small 
number. They would not be regarded as a random sample from a large 
number of methods. And if the experiment Were to be repeated with 
different groups of pupils, precisely the same methods would be used 
again (and so the 4;s would be unchanged). We describe this by saying 
that we have a fixed-effects model. The methods are not subject to sampling. 

In contrast, however, we could conduct experiments, with the same 
randomized-groups design, in which the basis of the group classification— 


that which corresponds to the methods of a methods experiment—is 


subject to sampling. We could, for instance, test one group of pupils from 


each of a number of different, and randomly selected, schools. It would 
then be appropriate to regard the 4;s—now the population means from 
the particular schools—as 4 random sample of a population of similar 
means. And if the experiment were to be repeated, not only different pupils 
but also different schools would be used. The model would now be said to 
have random effects. For such a model it would be sensible to replace 


> 4? 
: 1 by c2, c? being the accepted symbol for a population variance and 


the subscript 4 showing that this varianc 
individuals but of entire groups. 


has the familiar form of a variance estimate. It would, in fact, 


e derives from the variation not of 
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There is no reason why this replacement should not also be made for a 

fixed-effects model, provided we appreciate that then the replacement is a 
x4i 

formal one only. Some writers insist on retaining the form RT for 

fixed-effects models, while others prefer different symbols, such as «2 or 

9%. To the present writer this seems unnecessarily pedantic, The advan- 

tages of using the form c2 throughout, whether the basis of group classifi- 


cation is subject to sampling or not, will be apparent when more 
complicated designs are described.* 


Table 3.5 Components analysis for a 
randomized-groups design 


Source of Degrees of Mean-square 


variation freedomt expectation 
Groups (A) k~1 6? 4- ng 
Individual k(n—1) c? 


Total kn—1 


T It is assumed that there are k groups, each of size n, 


We therefore write the Parameters estimated by the mean square of the 
two independent 


Sources of variation in a randomized- i 
wo it Broups design 
(individuals and groups) as shown in the last column of table 3.5. A 


component analysis such as this will be found to be indispensable for 


testing the Significance of different Sources of variation in the more 
Complex designs described later in the book, 


3.4 Groups of unequal size 


the group 4 
(see Wilk and Kemptho 
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it is more convenient to use groups of the same size whenever possible, the 
method of analysis could equally well be employed with groups of different 
Sizes. 

As far as the computation of the sums of squares is concerned, all that 
need be especially noted is that in calculating the between-groups sum the 
square of each group sum must be divided by its own group size (nj). 
Thus, for the data recorded in table 3.6 the between-groups sum of 


Squares would be 


180? 1837, 280? (180+183+ ::- +280)? 
“|G Jie 9 49 
Sum of 7 terms 
= 510-20 


Table 3.6 Scores of seven groups of unequal size 


25 28 26 22 23 28 

26 24 21 28 

25 27 

24 26 
nn Ee 

Group sum 180 183 266 205 148 181 280 


Group size (n)| 6 5 9 7 6 7 9 
Group mean | 30-00 36-60 29-56 29:29 24.67 25:86 3111 


The total and within-groups sums would be obtained in precisely the 
Same way as before (p. 42). The reader may wish to verify that the analysis 
of variance would be as shown in table 3.7. The degrees of freedom for 
Within groups, obtained as before by subtraction, would also result from 
summing n;—1 (i.e. group size minus 1) for all the groups, and is therefore 
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equal to N—k, where N is the total 
groups. For this example, therefo; 
groups would result from 


number of scores and k the number of 
re, the degrees of freedom for within 


à0-1 = 5444 <- 482 42 


The F ratio is 5 7 6:54, as compared with the reading of 3-27 from 


statistical table 2B for 6 and 42 degrees of fre 


edom. The differences among 
the group means are significant at the 1-per- 


cent level. 


Table 3.7 Analysis of Variance of the data in table 3.6 


Source of Sum of Degrees 

variation squares of freedom Mean square 
Between groups 510-20 6 85-03 
Within groups 545-93 42 13-00 
Total 1,056-13 48 


The general model would be Written as 


Same meanin 


p. 55) except that j now Tuns from 1 to Ni, 


& as before (equation 11, 


With i running from 1 to k as 
before, 

In the components analysis, the term 02 must now be multiplied by an 
average of the ns in th 


€ mean-square exp 
The correct average (n,,) is given by 


ectation for groups (see table 3.5). 
1 En 
Nav = —[ Xn, A 12 
palin S) (12) 
where Xin, is the sum of the i 2 
BTOUD sizes, and f the 
Squares of the Broup sizes,* p, p bie 3271 the sum o 


or the data in table 3.6 we therefore have 
. *8ee Kempthorne (1952), If of ci i 
Le. 7; = n for all i, Nev reduces to n. = te OEE a Cette is 
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Xn 6454-5: +9 = 49 


and En? = 364+25+ `- +81 = 357 
1 357 
so that = 49— —— | ='6: 
i. 4 9 i 6:95 


The components analysis for the general case of groups of unequal size is 
Shown in table 3.8. 


Table 3.8 Components analysis for 

a randomized-groups design with groups of 
unequal size 
rr 
Source of Degrees of  Mean-square 


variation freedom* expectation 
Groups (4) k-1 c? 4 na, OA 
Individuals N-k o? 

Total N-1 


* There are N scores in all, distributed over k groups. 


Tna = Sik "-EE m being the number of scores in the ith group. 
= m 


3.5 Orthogonal comparisons 


designed comparisons between the groups of 
We saw that the between-groups 
e which had 3 degrees of freedom 


Earlier in this chapter some 
a methods experiment were described. 
sum of squares of an analysis of varianc 
(table 3.2) was replaced by three components, each with 1 degree of freedom 
(table 3.4). Each of these components Was based on a comparison between 
two or more of the method groups. Whenever we have a number of 
different method or ‘treatment’ groups, it is possible to analyse the 
between-groups sum of squares into separate components, each with a 
Single degree of freedom, in this way. Indeed, with three or more groups 
such an analysis may be made in several ways, though whether more than 
one way would be of interest in any particular experiment is another 


matter. 
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denoted by S,, S, - 
then be written as 


S, with the mean of S5, S3 and S4 
same as comparing 3S, with S+S, +S. We therefore put A, = 3, 
and each of Az, 4, and LA i 
With groups 3 and 4, we 
into the comparison), A, = 2, and A 
that there Would be no point in Comparing S, with S5-- S, + S4, or Sy 
with S3+S4. Our interest would always be confi 
the sum of the Coefficients is Zero. 

Our reason for thinking of a Comparison in terms of A coefficients in 
this way is that the sum [9 


f squares may be written down straight away. 
The sum of Squares for any comparison is 


c? 


na 
Thus, for the group 1 v groups 


2, 3 and 4 comparison in the methods 
experiment (see table 3.1) the su 


m of squares would be 
x117—150.. 165—168)? 
a) 


xi — 242, as Shown in table 3:3; 
Again for the group 2 v Broups 3 and 4 
(2x 150—165... 168)? 
| te = 3025, as shown in table 3.4, 


Comparison the sum is 


Contrast this With the Previous method of calculation (pp. 51 and 53) 
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In each denominator 6 is the number of scores in each group. The second 
factor, 3,42, is obtained as 3?+(—1)?+(—1)?+(—J)? in the first 
comparison, and as 2? 4- (— 1)? - (— 1)? in the second. 

Two comparisons are said to be orthogonal, or independent, if the sum 
of the products of the corresponding coefficients is zero, i.e. if the com- 
Parisons are 


€, = Ai Sia Sot oo Au S 
and Cy = Agy Sy Àz2 Sot cr tA Se 
then they are orthogonal if 
Aya Agr tAi2 dant t tA Ane = 0 d» 


The two comparisons of the methods experiment referred to above are 
orthogonal, since (3)0)--(—1)2)--(— DC- D-- C7 DC- D = 0. Indeed, 
any two of the three designed comparisons described for this experiment 
are orthogonal, as may be seen from the coefficients set out below. 


oe 
A Groups 
Comparison | 2 3 4 
EE 
€i 3 -1 = =1 
Co 0 2 =l = 
C3 0 0 1 -1 
a TORUM a ti 


The importance of orthogonality in three or more comparisons is that 
the sum of squares for a second orthogonal comparison is part of the 
residual of the between-groups sum, i.e. it is a component of that part of 
the between-groups sum left after the removal of the sum for the first 
comparison; and the sum of squares for a third comparison, orthogonal 
to both the other two, is a component of that part of the between-groups 
sum left after the removal of the sums for the first and second comparisons; 
and so on. In other words, if with k groups in all, we select k—1 com- 
Parisons mutually orthogonal, then 

c? ci US ND lee Seis 
-ataa °° +572 sum of squares 
agi abi n X q 
This has already been illustrated for k = 4 groups by the three 
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mutually orthogonal comparisons described for the methods experiment, 
since 


242-00--30-25--0-75 = 273-00 


which is the between-groups (or methods) sum of squares in table 3.2. 3 
It is possible to partition the between-groups sum of squares into w- : 
than one set of mutually orthogonal comparisons. Thus, coefficients fo 


another set of orthogonal comparisons for four treatment groups are 
shown below. 


I ee 


Groups 
Comparison 1 2 3 4 
7 
ci 1 1 -1 -1 
c2 1 -1 1 =] 
[^ 1 -1 -1 il 


—S ÁN 


As before, we see that the sum of the coefficients in each row is zero, and 
that for any two of the TO 


Ws the sum of the products of corresponding 
coefficients is also zero. 


The 
be of interest. T: 


Group 1: phonetic approach, with traditional orthography. 
Group 2: ‘look-and- 


say’ approach, with traditional orthography. 
Group 3: phonetic approach, with augmented roman alphabet. 
Group 4: ‘look-and-say’ approach, with augmented roman alphabet. 
Comparison ci would then compare the Suitability of the two types of 
reading material; the second comparison c; would compare the effective 
Dess of the two approaches; and the third comparison c5 would compare 


the differences in the effectiveness of the two approaches with each type 
of material. This last ecessary, since it would be unwise tO 
assume that any diff 


© approaches would be unaffected 
by the reading materia]. 
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Table 3.9 Ani ty an 
e Ilustrati " 
S i ion of Bartlett's test of hom: i i 
ums of squares derived from scores in table 3.6 dia 


De eee 


Sums of 
Grou, idc Deg P ces 1 Mean 
within o, =, square log S? (n;—1)1o S2 
groups) freedom (n,—1) (S?) ego 
1l 64 
2 2 : 0-2000 12-800 1:1072 5:5360 
3 1222 2 0-2500 — 3-800 0:7634 3:0536 
4 6943 8 0:1250 17-778 1:2500 10:0000 
5. 2733 6 0-1667 11:572 10633 6:3798 
6 78.86 5 0:2000 5466 0:7377 3:6885 
7 140-89 6 01667 131143 11186 67116 
: 8 0.1250 17-611 12457 9:9656 
=7 545. 
is or e 1:2334 45:3351 
=x? Pa-i > 1 
mom X (n,- Dog S; 


2 B 
ax 545-93 _ 13-00 


E@-) 24 
X (n,—1)1og S2 = 42 x log 13:00 = 42 
2 = 2:3026 [i (0; 1) log s2- X (n-01og Si] 
= 2:3026 [46-7838 —45:3351] = 3:358 


Mean square, S2 = 


x1:1139 = 46:7838 


Correction factor, C = 1+ 1 (> WM X! | 
3(k—1) (n;—1) 3XQq-0 


1 1 
-14l [12334- 5| = 10672 
14g! 2334 zl 


Corrected y? = 3358 _ 5:145; degrees of freedom, 
1-0672 kl 
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com- 
In some experiments it might be unnecessary to ve ae a? 
parisons: only some of the k—1 comparisons might be of in "uet ad 
always more efficient, however, to incorporate whatever compari 


i his is 
of interest in a set of mutually orthogonal comparisons whenever t 
possible. 


Modifications for unequal groups 


inear 
If the k groups do not contain equal numbers of Scores, then the lin 
function of the group sums 


c = AS AS +4, S;, 
is a comparison only if 


14) 
Ny Ay +n Ant pni =0 ( 


c? 


Xn— 
Xn 
i=1 
the denominator being the simil. 
Coefficients, 

Finally, the criterion for the ort 
the weighted sum of the products 
i izes of the 
, the comparisons Cy 
D Ay Agi tn, Aya f 
The sum of Squares for each Successive comparison, from a set o 
mutually orthogonal compari 
sum in exactly the same way 


A e 
arly weighted sum of the squares of th 


hogonality of two comparisons is that 
of corresponding coefficients be ae 
groups; i.e. with the same notation 4 
and c, are orthogonal if 


as before, 


3.6 Testing for homogeneity of variance 


At the end of Chapter 2 we mentio; 
geneity of variance may be investigat 
The test will now be described fro 
may note that the group 
Stroup 3 is seen to have a 


ned that a suspected lack of homo- 
ed by a test devised by Bartlett (1937) 
m data set out in table 3.6 (p. 59). W' 
variances differ appreciably. Thus, for errr 
Tange of scores twice that of group 2, and i 
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Statistical table 3 Distribution of 7?* 


“2 Ta 
Degrees Probability that the value of x? shown in the body of the 
of table will be exceeded in random sampling 


1 0-016 0-15 046 107 271 384 541 664 
2 021 O07) 139 241 460 599 782 921 
3 0-58 142 237 366 625 T82 984 1134 
4 106 219 336 488 778 949 1167 1328 
5 — 161 300 435 606 924 1L07 1339 1509 
6 220 383 535 723 1064 12.59 1503 168 
7 283 46] 635 838 1202 1407 16:62 18:48 
8 3.9 553 734 952 1336 1551 1817 2009 
9 447 639 834 1066 1468 1692 19:68 

10 486 727 934 1178 1599 1831 2116 2321 
ll — 558 8-15 1034 1290 1728 19°68 2262 2472 
12 630 903 1134 1401 18.55 2103 2405 

13 T7404 993 1234 1512 1981 2236 2547 27°69 
l4 7:719 1082 1334 1622 21-06 23-68 2687 29-14 
15 8:55 11:72 14:34 17:32 22-31 25:00 28-26 30:58 
16 931 1262 1534 1842 23:54 2630 29°63 

17 1008 1353 1634 1951 2477 2759 31-00 3341 
18 1036 1440 1734 20°60 2599 28°87 3235 3480 
19 11-65 1535 1934 21-69 2720 3014 3363 36:17 
20 1244 1627 1934 2278 2841 3141 35-02 3T57 
2l — 1394 17de; 2034 2386 2040. 32:01 3634 3895 
22 1404 1810 21-34 2494 3081 3392 3766 4075 
2i thes 1o02 2234 2602 X201 3517 S987 Algi 
24 15465 1994 2344 2710 3520 3642 4027 4233 
25 1647 20.87 243A 2817 3438 3765 41-57 4431 
26 1729 2149 2534 2925 3556 3888 4286 4564 
21 igdi 22-D 2634 3032 3674 4011 4434 46.96 
28 1894 235 2734 3139 3792 4134 4542 4828 
20 1977 24.58 2834 3246 3000 42:56 4969 4999 
30 2060 2531 2934 33:53 4026 43:77 4796 50-89 


* Adapted from table 4 of Fisher, R. A. and Yates, F., Statistical Tables for 
Biological, Agricultural and Medical Research, Oliver & Boyd, 6th edition 1963, 


Y permission of the authors and publishers. 
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variance is actually more than three times that 


of group 2. The calculations 
are set out in table 3.9. 


The sums of squares in the first column of the table are the within- 
groups sums, and their total is the Within-groups sum previously obtained 
(table 3.7). Similarly, the overall within-groups mean square, S2, is the 
same as that previously obtained. The last two columns involve the 
logarithms of the Separate group mean squares (i.e. the sum of squares 
divided by the degrees of freedom). The factor 2-3026 in the calculation of 
x5 is necessary because common logarithms are used. 

The resulting 5? when referred to statistical table 3 with 7 degrees of 
freedom shows a probability of over 0-70, i.e. it would arise by chance, on 


- Accounts of such procedures at® 


d Snedecor (1956, pp. 287-9). 
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Chapter 4 Nesting Designs 


4.1 Introduction 


her. Suppose, for instance, 
pparatus is as a teaching aid 
for example, the Dienes Multibase 


We would therefore select: 
l. A number of primary Schools—possibly a random sample of such 

Schools in a given region, though this is not essential. (Thus it might 
be better deliberately to Select schools from contrasting social areas, 
or schools with a different interna] organization, e.g. streamed and 
unstreamed Schools.) 
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sia rd Le (pupils) being nested within the other (teachers). While it 

Mec e apes to select equal numbers of pupils for each teacher, 

Vibia a rd teachers for each school, this arrangement is to be 

i. in tl at it provides the most economical use of resources. An 
periment designed along these lines will now be described. 


4.2 An experiment with two levels of sampling 


Lae ae E an investigation into the suitability of a new apparatus 
pica x ing aid is conducted in four schools, that three teachers are 
Benen random from each of the schools, and that each teacher uses the 
Et s v a random group of six pupils. This implies that, within 
Fo ool, the teachers selected would have to be allocated at random to 
i (randomly selected) pupil groups. We will suppose, too, that after a 
Do series of lessons the attainments of all the pupils on a suitable 
are as shown in table 4.1. 
mt partitioning of the sum of squares will now take account of the 
"a nces between schools, and of the differences between teachers within 
ools. The calculation is as follows: 
1. Overall sum of scores, Dx = 6304- 7544-697 + 654 
= 2735 


(the overall mean is therefore r = 3799) 
2. Correction t qu» = 2735 x 2735 
erm (overall), N 72 
= 103,892-01 
= (MATH +++ +297) 
Sum of 72 terms 
—103,892-01 
105,637-00— 103,892-01 
1,744-99 
630? 754? 697? 654 


4. Between- = 18 
een-schools sum of squares 18 ap 18 T 18 st 18 
—103,892:01 


3. Total sum of squares 


ll 
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= 104,385-61— 103,892-01 
= 493-60 


pw Pe 2m 
4- Similar terms for 
schools II, III and IV 
= 104,589-16 — 104,385-61 
= 203-55 
6. Within-groups sum of squares = 1,744-99 — 493-60 — 203-55 
= 1,047.84 


5. Between-teachers, within-schools| _ 227° , 210” 193? 630? 
sum of squares 5 


As before, the total sum of squares expresses the variation of all the 
scores without taking account of their separation into groups. It is the 
sum of the squares of all the scores expressed as deviations from the overall . 
mean (37-99). 

The between-schools sum (493-60) expresses only the variation between 
schools, i.e; the variation that would occur if all the scores were replaced 
by the mean score of the school to which they belong. It takes no account 
of the variation between teachers, or of that of individual pupils. 

The between-teachers, within-schools sum of squares (203:55) gives a 
measure of the variation between teachers that results from isolating the 
teacher differences within each school, and then summing for all schools. 
Note that the (overall) correction term is not now used. Instead we have 


630? 754? 
separate correction terms for each school Ces etc.}. The result- 


ing sum is essentially the sum of the squares of all the teacher mean scores 
expressed as deviations from their own school mean—i.e. (37:83 —35:00)? 
for the mean score of teacher 1 of school ]—increased sixfold (since there 


are six pupils in each group). . 
The within-groups sum of squares (1,047-84) obtained above by 
subtraction could also be described as the sum of squares between pupils 
d also be obtained as the sum of the 


within teachers (or groups), and coul be l 
squares of all the scores expressed as deviations from their own group 


mean (see p. 36). . 1 
These sums of squares are set down in table 4.2, together with the 
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corres; 
freedom for the variation between teachers within schools are 8, there 


Table 4.2 Analysis of variance of the data in table 4.1 


ees Sum of Degrees 
Source of var iation squares of freedom Mean square 
Schools (S) 493-60 3 164-53 
Teachers (T), within S 203-55 8 25.44 
Pupils (P), within T 1,047-84 60 17:46 


Total 1,744-99 71 


From table 4.2 we see that the ratio of the mean square for teachers to 


2 
that for Pupils (or within groups) is obtained as F= 1746 
60 degrees of 


= 1-46. 


This is less than the reading in statistica] table 2A for 8 and 
freedom (2-10). The differences a 
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4.3 The model 


The general model for the breakdown of a single score in the experiment 
described would be as follows. The score Xi; of the kth pupil in the jth 
teacher group in the ith school is assumed to consist of four independent 


components given by 
Xijk = M+S;4+T;j+eijx (16) 


where M is a component common to all the scores; 
S, is a component common to all scores in school i; 
T;; isa component common to all scores in the teacher group within 
school i; 
and e;,, isa components 
{i 
In the particular experiment described, i wou. 
3 and k from 1 to 6. 
We may put S; equal to the population mean of the scores for school 7 
expressed as a deviation from the mean of the population means for all 


the schools. In other words, summing for all the schools, 3; $; = 0 
i 


pecific to pupil k of teacher group j within school 


ld run from 1 to 4, j from 1 to 


put M equal to the mean of the popula- 
in table 4.1 the overall mean of the 
35.00—37-99 = —2:99 estimates 
8.72—31:99 = 0-73 estimates S5, 


(see p. 55). As before, we may also 
tion means for all the schools. Thus, 
seventy-two scores, 37:99, estimates M, 
Si, 41:89—37-99 = 3-90 estimates Soy 

and 36-33 —37:99 = — 1:66 estimates Sa 
Similarly, we may put T;; equal to the population mean of the scores of 
teacher group j within school i (i.e. the mean of all the scores that would 
his particular teacher in 


result if the population had all been taught by t 
this particular school) and express this as a deviation from the population 
mean of all the scores for school ; (ie. as a deviation from M-- Sj). 
Summing for all the teachers in the school, we therefore have X Ti; = 0, 


and for each i in turn. In table 4.1 37:83—35:00 = 2-83 estimates T11, 


35-00— 35-00 = 0-00 estimates T2; and so on. : 
We also assume that for any one school the population means for the 


different teacher groups themselves form part of a normally distributed 
population with a variance which is the same as that for each of the other 


schools in turn. We shall denote this common population variance of 


teacher group means by oi. 
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Finally, since e; jk İs the specific contribution of a pupil in one teacher 
group, and since M +8S;+T7;; already gives the population mean of the 
scores for that group, it follows that, summing with respect to k for the 


whole population, Y; eij = 0. It follows that the mean of all the ijs 
k 


is also zero. We assume, too, that for any given i and J, eig is from a 
normally distributed population, with a variance which is the same for all 


such populations, i.e. for all values of i and j in turn. As before we shall 
denote this common population variance by o. 


Table 4.3 Components analysis for a nesting design 


—————— —— ODE 


33 Degrees of Mean-square 
Source of variation of freedom* expectation 
Schools (S) s—l c? *- no? 4- tno? 
Teachers (T), within S s(t—1) 6? 4- no? 
Pupils (P), within T St(n— 1) c? 
Total stn—1 


* These are written for 5 schools, with ¢ teachers within each school, and 
with n pupils for each teacher, 


In summary, then, we may simply say that the model supposes e; to 
be drawn from a normally distributed population with a mean of zero and 


appears in the mean-square expectation fi 


variance of the Population of schools of which the particular schools 

chosen may be considered a random sample. (Alternatively, this com- 
ind S? 

ponent would be written as rx 5 being the number of schools, if 


the schools effect is to be regarded as fixed; see P. 57.) We note, therefore, 


Nesting Desi; 
ig Designs T 


p three components enter into the mean-square expectation for 
ols, one component for each of the independent sources of variation, 
Pupils, teachers and schools. This must be SO, since the obtained differences 
between schools would obviously be affected by any real differences 
between the schools themselves and any real differences between the 
teachers Within the schools, as well as by differences between the pupils 
within the teaching groups. 
F From table 4.3 it is obvious that the appropriate error term for testing 
he significance of school differences (i.e. o2 not being zero) is the mean 
Square for teachers. This is because 


ve Mean square for schools -—— c? +no3 + tno? 
zz EA PRILSQUBDGUOT/SeHROO Ste RITIHIGS eee 
Mean square for teachers a? 4- not 


le. a quantity greater than unity only if of is non-zero. In the same way 
teacher differences are tested by the mean square for pupils, since 


c? - no? 


Mean square for teachers |. 
estimates zm 


Mean square for pupils 


which is greater than unity only if a? is non-zero. 

Clearly we must always select for our error term the mean square 
estimating all components in the mean-square expectation of the differences 
tested except one—the additional component involving the specific effect 


(schools, teachers, etc.) being tested. 


44 Modification for a model with fixed effects 


The model just described is, of course, one with random effects, both 
teachers and pupils being randomly selected. Suppose, however, that our 
interest is not with teacher differences but with method differences, and 
that in each of the schools concerned the same three methods were used. 
We could, for instance, eliminate the distinctive influence of particular 
teachers by programming the different methods, the human teachers being 
replaced by, say, linear programmes, so that for any one method the same 
sequence of operations would be covered in all the schools. The comparison 
of teachers would now be replaced by a comparison of methods, but the 
methods—unlike the teachers—would not be subject to sampling. They 
would give rise to fixed effects. The model now having both random effects 
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(for pupils) and fixed effects (for methods) would be described as a mixed 
model, and the components analysis would be as shown in table 4.4. 


Table 4.4 Components analysis for the nesting design 
of table 4.3, with methods replacing teachers 


Degrees Mean-square 
of freedom* expectation 


Schools (S) s—1 c? --mno$ 


Methods (M), within S — s(m—1) c? - no 
Pupils (P), within M sm(n— 1) c? 


Source of variation 


Total smn—1 


* These are written for s schools, with m methods within each school, and 
n pupils in each method group within each school. 


Table 4.4 corresponds to table 4.3, with M (for methods) replacing T 
(for teachers)—and hence c£, the population variance of the method 
means, replacing c?—and with m (number of methods within each 
school) replacing t. The mean-square expectation for methods therefore 
becomes c? +n. The mean-square expectation for schools, however, is 
not c? 4-noj, --mno$ but simply c?--mno?. This is because M is a fixed 
effect, and differences in methods (which are precisely the same for each 
of the schools) do not contribute to the obtained differences between 
schools. These differences must now be tested for significance against the 
mean square for pupils, since 

Mean square for schools c? 4- mno? 
= Meas f i estimates —— —5 — 
quare for pupils c 

which is greater than unity only if o2 is non-zero. (Thus, the test for the 
164.53 — TED 
1746 ^ 9-42, so showing 
differences between schools to be significant at the 1-per-cent level, à 
higher level of significance than before.) The difference from the preceding 
situation aptly illustrates that the choice of error term should always be 
dictated by a prior components analysis, not by any rule of thumb. This 
will become even more apparent as the discussion proceeds. 


data of table 4.1 would give F— 
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4.5 A nesting hierarchy 


Nesting need not be confined to two levels: it could be extended indefinitely. 
Thus, to develop our present illustration, we could begin by contrasting 
two or more types of school (e.g comprehensive-tripartite, streamed- 
unstreamed), or two or more regions in the country, select schools in each 
of these types or regions, select teachers in the schools, and finally select 
pupils for each of the teacher groups chosen. We would then have a 


Table 4.5 Components analysis for a nesting hierarchy, all effects random 


es. 


Source o iat i DERE Mean-square ex] ectation 
A a-l c^t noZ-+dnoe+ educa 4 bcdioa 
a(b-1) c? nod dnoé ^- cdnos 


B, within A 
C, within B abc—1) o -- no 4- dnot 
D, within C abc(d—1) 0? +05 


d(n—1) o? 


Individuals, within D abc 


hierarchy with three levels, with pupils nested within teacher groups, 
teachers within schools and schools within types (or regions). We could 
also extend the hierarchy downwards by administering more than one test 
to the pupils (an objective test and an essay-type test, say)—Oor again, 
administer the same or a parallel test on à second occasion—and so 
consider also the differences between tests—or between occasions—within 
Pupils. A numerical illustration of this is given later. 

Suppose, then, that in a nesting hierarchy A refers to the uppermost 
level, and that there are a groups of A; that B refers to the second level, 
and that there are b groups for each of a groups of A; that C refers to the 
third level, and that there are c groups of C for each of the ab groups of 
B; that D refers to the fourth level, and that there are d groups of D for 
each of the abe groups of C. (We could obviously continue the process 
indefinitely with levels E, F, G, etc» but the general argument should now 
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be apparent.) With the same notation as before (equation 16, p. 75) the 
model would be as follows: 


Xin = M+Aj+ Big Cic Digit Cijnim (17) 


All five contributions to the score X;j,4, would be independent of each 
other, and the As, Bs, Cs, Ds and es would all be drawn from normally 
distributed populations with means of zero and variances of o2, 62, o2, 0» 
and c? respectively. Also, with n individuals in each of the ultimate (D) 
groups, and with all the 4, B, C and D effects being random, the com- 
ponents analysis would be as shown in table 4.5. 

A differences would then be tested for significance against the mean 
square for B; B differences against the mean square for C; C differences 
against the mean square for D; and D differences against the mean square 
for individuals. , 

Suppose, however, that one of the effects, C say, is fixed. This implies 
that differences arising from C are precisely the same (i.e. not subject to 
sampling) for each of the ab groups of the higher-level classifications 


2 
(4 and B), and that c?—which might also be written as n in the 
third row of table 4.5, there being now no population for the Cs—does 
not contribute at all to the variation of A and B. The term dno must, 
therefore, be deleted from the first two rows of table 4.5, and the mean- 
square expectations would then be as shown in table 4.6. 

The B differences would now be tested for significance against the mean 
square for D, not C, the A, C and D differences being tested for significance 
in the same way as before. The reader may easily verify that if, in addition; 
the D effect is fixed, both the B and C differences would then be tested for 
significance against the mean square for individuals. 


4.6 An experiment with three levels of sampling 


A convenient illustration of a three-level nesting hierarchy can be 
developed from the experiment described in section 4.2. Thus, suppose that 
having selected the schools, and having sampled teachers within the schools 
as before, we now decide to test the effectiveness of the new teaching 
apparatus both immediately after the series of lessons and after an interv# 
of, say, a week. Our intention, in other words, is to test retention as well a5 
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immedi š 

oe We express this by saying that we are sampling 
e oh on n teachers, by arranging for each teacher to teach two 
a, ae one of the groups taking an immediate test (at the 
Debdslon 3. W : teaching)—occasion i—and the other a delayed test— 
To A : a nally select random groups of pupils for each occasion. 
hathetes e partitioning of the total sum of squares, we will suppose 
GP. Scores shown in table 4.7 have been obtained. All the occasion i 

ave, in fact, been reproduced from table 4.1. 


Table 4.6 C 

i : omponents analysi ing hi 

with the C effect fixed ysis for the nesting hierarchy of table 4.5, 
a eT 


Degrees d 
B Mean-square expectation 


Source of variati 

riation 

of freedom 

A 
Beate a-1 o? +nopt cdnc2 4- bcdnoá 
C. je A a(b-1) c -Eno 4- cdnas 
D ee B abc—1) c? -no3-- dnoc 
» Within C ab(d—1) c? noi 


Individuals, within D — abcd(n-1) 9^ 


__Individuals, within D  abdn-D O  — .——— — 
Total abedn—1 


s between schools, 


D ie now take account of difference ; differences 
teachers eachers within schools, differences between Occasions within 
betwee as well as the basic differences between pupils (or differences 

n pupils within occasions) as the ultimate within-groups sum of 


Squ 
quares, The calculation is as follows: 


Xxx-18* 1490-+ 1363+ 1269 


1. Overall sum of scores, 
= 5365 


yl 5365 x 5365 
4 13442 
= 199,883:50 


2. i 
Correction term (overall), 
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Table 4.7 Scores of two groups of pupils from each of three teachers in 
each of four schools, the groups being tested on different occasions 


SCHOOL I 


SCHOOL II 


Teacher Teacher 


1 


2 3 


Occasion Occasion Occasion 


ia d$ NH d u |i d d d i a 


227 217 210 213 193 183 


Teacher 
totals 


3. Total sum of squares = (442441243924 --- +287) 
Sum of 144 terms 


~199,883-50 
= 204,127 — 199,883-50 
= 4,243-50 


12432 1490? 1363? 1269 
4. Between-schools sum of squares = Ex dem T 36 
— 199,883.50 
= 200,924-41 — 199,883-50 
= 1,040-91 
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Table 4.7 (continued) 


SCHOOL III SCHOOL IV 
m.) EN c 
Teacher Teacher 
EE 
1 2 3 


Occasion Occasion Occasion 


Occasion Occasion Occasion 
ii i i i ii i it Hen 


i d i d i ï 


Di Between-teachers, 
within-schools sum of squares 


== "x 2 3 


+ Similar terms for 
schools II, III and IV 


= 201,422:41 — 200,924-41 


) 444 423 379 124» 


= 498-00 
6. Between-occasions, 2B 227 ; Rug 
within-teachers sum of squares 6 6 2 


+ Similar terms for the 
other 11 teachers 
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= 201,561-12 — 201,422.41 
= 138-71 i 
7. Within-groups sum of Squares = 4,243-50— 1,040-91— 498-00 


—138-71 
= 2,565.88 
The sums of squares for between Schools and for between € 
within schools are obtained in the same way as before. The as 2 
between occasions results from a precisely similar process, using a sep 
h 444? 423? " he 
correction term for each Occasion (= Ty , etc.) to isolate t| 


The analysis is set down in table 4.8. As before the degrees of freedom 
for schools and teachers are 3 and 8 respectively, There are 12 degrees 
of freedom for occasions, 1 for each of the twelve teachers, and 
143-3—8—12 = 120 for Pupils (this number also resulting from there 
being 5 degrees of freedom 


within each of the twenty-four groups). 


Table 4.8 Analysis of Variance of the data in table 4.7 


537 Sum o; re 
Source of variation of Degrees 


are 
Squares of freedom Hiatiqu 
Schools (S) 


1,040-91 3 346-97 
Teachers (T), within S 498-00 8 62°25 
Occasions (0), within T 138-71 12 11:56 
Pupils (P), within O 2,565-88 120 21:38 
Total 4,243-50 143 


The model wou 


ld now be written with the Same notation as before 
(equation 16, P. 75) as 


Xij = MYSCET E Outen, (18) 
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x;j, being the score of the /th pupil in the kth occasion within the jth 
teacher of the ith school, and where the components M, S; and T;; are the 
same as before, where Oj; is the component common to all scores of 
occasion k within teacher j within school i, and where e;;,4 is the component 
specific to pupil / of occasion k within teacher j within school 7. (i runs 
from 1 to 4, and j from 1 to 3 as before, but k now assumes only the 
values 1 and 2 and / runs from 1 to 6.) The components analysis would 
then be as shown in table 4.9, ci, o2, cà and o? denoting the population 
variances of the respective components in the same way as before. The 
occasions (O) effect is fixed—we have not randomly selected the two 
testing occasions, but have selected them deliberately to test immediate 
and retained learning—so that the term noà does not appear in the mean- 


square expectations for teachers and schools. 


Table 4.9 Components analysis for à schools-teachers- 
occasions-pupils nesting hierarchy 


d Degrees Mean-square 
Source of variation of freedom* expectation 
Schools (S) (s—1) c? -- ono?-4- tonos 
"Teachers (T), within S s(t—1) c? --onot 
Occasions (0), within T  st(0— DNO +nod 


Pupils (P), within O sto(n—1) c? 
ston—1 


Total 
1 teachers in each school, o occasions 


* These are written for s schools, with l 
for each teacher and 7 pupils within each occasion. 


School differences would then be tested for significance as before 
for teachers, but both teacher and occasion 


against the mean square 
differences would now be tested against the mean square for pupils. From 
34697 — 5.5] 


the school differences F- $225 
which for 3 and 8 degrees of freedom is significant at the 5-per-cent 
level (statistical table 2A). For the teacher differences We have 


a = 2-91, which for 3 and 120 degrees of freedom is significant 


table 4.8, therefore, we have for 
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at the 5-per-cent level (statistical table 2A). Finally, for the occasion 
differences we have F — = < 1, showing that the differences between 
occasions are too small for significance at the 5-per-cent level. Indeed, 
occasion differences of the magnitude shown in table 4.8 could be expected 
more than half the time from chance. We may conclude that schools would 
not be equally successful in using the new apparatus; that judged by the 
total evidence available (and not that from occasion i alone) the apparatus 
could not be used with equal success by all teachers in the schools; and 
that whatever success the apparatus has may be taken to apply both to 
immediate and retained learning. 

One feature of the design deserves special note. The two groups 
selected for each of the teachers must be independent random samples. 
(Mathematically the set of specific components eji, for the different 
values of / must be independent from the sets resulting from the previously 
taken different values of k—and, for that matter, for the different values 
of i and j, too.) It would not do for the groups to be matched in any way, 
still less for the groups to consist of the same pupils tested twice. Yet from 
the viewpoint of administrative efficiency this would seem the obvious 
arrangement to make. Why should not all the pupils taught be tested on 
both occasions? Any disadvantage from memory effects could be removed 
by using not the same test but a parallel test on the second occasion. The 
answer is that this arrangement could well be adopted, and would in fact 
have a decided advantage over using different pupils for the two occasions, 
but that if it were adopted the mode of analysis just described would be 
inadequate. It would be inadequate because the partitioning of the total 
variation has taken no account of the differences between pupils across 
occasions. 

Suppose, for instance, that within each of the teacher blocks of table 4.7 
the two scores in each row, under occasions i and ii, belong to the same 
pupil. These scores could then be totalled to give a pupil total, in the same 
way as the scores in each column have been totalled to give an occasion 
total, and from which a sum of squares for the differences between 
pupils within teachers could be derived corresponding to the sum of squares 
between occasions within teachers already obtained. Indeed, we could then 
also view the design as one with pupils (not occasions) nested within 
teachers, and occasions within pupils! A full account of a design of this 
type is given in chapter 5. 
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Chapter 5 Crossing Designs: Randomized Blocks 


5.1 Introduction 


In almost all research involving school pupils, college or university 
students, and indeed human beings generally, the one outstanding feature 
found is the very considerable extent of individual variation. This renders 
the randomized-groups design of little value unless we are able to offset 
the effect of individual differences by having very large groups. It is usually 
more economical to seek other ways of combating individual variation. 
One such way is afforded by the randomized-blocks design. 
: The randomized-blocks design is based on the principle of grouping 
into blocks persons who are expected to respond to the treatment of the 
experiment in a broadly similar manner—more similar, at any rate, than 
would be the case from a completely random group. By taking account of 
differences between the blocks, and removing these from the usual within- 
groups variation, a smaller error term—one based solely on differences 
within the blocks—will be obtained. 
The term ‘block’ derives from experiments in agriculture, where, for 
example, differences in the effectiveness of various fertilizers or other soil 
treatments are investigated. The field available for the experiment cannot 
be expected to be completely uniform. Some parts would be more fertile 
than others. Blocks are therefore marked out in such a way as to control 
differences in the natural fertility of the soil. Plots are then marked out in 
each of the blocks—the number of plots being equal to the number of 
treatments being investigated—and each treatment allocated at random 
to one of the plots in every block (see figure 4). In this manner it is hoped 
that a major portion of the differences in natural soil fertility—a factor 
extraneous to the experiment—will be apportioned to the differences 
between blocks, leaving a much reduced portion for the differences within 


blocks. 
In educational experiments concerned with human learning the most 
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obvious measure on which to form blocks is the person's intelligence, or, 
alternatively, his previous attainment in the type of task being studied. 
The variation in performance among school pupils all between I.Q. 110 
and 120 in a methods experiment, for instance, could be expected to be far 
less than that among a group selected without regard to intelligence. 
Comparisons between two or more methods groups would then be more 
likely to show real differences.* In other experiments appropriate measures 
on which to form blocks might be the score on a suitable pre-test, socio- 
economic status, sex, age and, in the non-cognitive field, such traits as 
extraversion, industriousness and emotional stability. The important 
consideration is that the persons within the same block should be more 
alike with respect to their probable response to the treatment than a 
completely random group. There is no point in forming blocks on the 
basis of something unrelated to the ability or trait being investigated. 

We should note, too, that although the choice of persons for each plot 
(or treatment subgroup) is restricted by conformity to the block, random 
selection within the bounds of this restriction is still necessary. The 
randomized-blocks design may in fact be regarded as embodying a series 
of randomized-groups experiments, one such experiment for each of the 
blocks. 


5.2 A two-way experiment 


For an illustration of a randomized-blocks design we turn to a methods 
experiment in which pupils are randomly selected for two method groups 
within each of three I.Q. ranges (for example, I.Q.s 90-99, 100-109 and 
110-119). Intelligence, in other words, is the measure used to form the 
blocks. For convenience of exposition we shall label the three blocks 
‘superior’, ‘average’ and ‘inferior’. With two methods being compared—in 
the terminology of the agricultural field experiment, two plots within each 
of three blocks—there will be six pupil groups in all, and so the final test 
scores may be arranged in the cells of a 2 x 3 table, as shown in table 5.1. 
Generally it would be better to select equal numbers in each group. The 
analysis to be described, however, also permits the group numbers in any 
pair of rows or columns to be proportionate, i.e. for any two rows (OF 

* Methods groups so selected are described as *matched" groups—matched 


with respect to intelligence in this instance—and hence the randomized-blocks 
design may also be described as a matched-groups design. 
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BLOCK 1 


BLOCK 2 


EASIEST 


BLOCK 3 
BLOCK 4 


Figure 4 Field plan of a randomized-blocks experiment with five 
treatments (A, B, C, D and E). 


pjeu 40 1uerpeiB Aime 
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columns) the numbers in each column (or row) must be in the same 
proportion throughout. This more general case is now taken, with the 
group numbers for methods 1 and 2 being 4, 8, 6 and 6, 12, 9 respectively. 


Table 5.1 Scores of six groups in a randomized-blocks 
design 


Methods (A) 


"Totals 


325 (10) 


557 (20) 


324 (15) 
1206 (45) 


N.B. The total score of each cell is shown circled. The number of scores 
contributing to each of the method and level totals is shown in parentheses. 


Each of the forty-five test scores can be classified in two ways, 4S 
belonging to a particular method (A) and also to a particular block of 
1.Q. level (B). Our interest, of course, is in the difference between methods 
within each of the levels as well as how these differences themselves differ 
from level to level. It is best to extract systematically from the tota 
variance the differences between levels (rows in table 5.1) as well as the 
differences between methods (columns) in the manner already describe! 
(chapter 3). 

We begin, however, by separating the total variation into the differences 
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nigro eem. scores of one method at any level being grouped into 
st ie six cells of table 5.1—without taking account of the twofold 
feta Ej of these cells, and the differences within cells. Differences 
di soli si are then analysed into components expressing the differences 
erbe y to (a) methods, (b) levels, and (c) a remainder, a component 

ich is termed the interaction between methods and levels. The complete 


partitioning is shown in figure 5. 


Total variation 


Between cells x Within cells 


Between Between Interaction 
methods (4) levels (B) (4x B) 


Figure 5 Breakdown of variation ina two-way methods experiment 


The calculation is as follows: 

= (3862+33? + ass iy ES 
Sum of 45 terms 45 

= 33,716 — 32,320-80 

= 1,455:20 

120? 205? 
dre + [ & IR! T 


— 


2. Between-cells sums of squares — ( 4 


1. Total sum of squares 


189? 12062 

Des 
Sum of 6 terms 

= 33,251-25—32,320°80 

= 930°45 

= 1,455-20—930°45 


= 524-75 
he total variation into between 
g of the variation between cells 


3. Within-cells sum of squares 


nitial separation of t 


(This completes the i 
bsequent partitionin 


and within cells; the su 
is as shown below.) 


* See p. 42. 
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453? 7535 __ 1206? 


4. Between-methods (A) sum of squares mi TT a5 
= 32,400-83 — 32,320-80 
= 80-03 


2 2 2 6? 
5. Between-levels (B) sum of squares = = = > E 
= 33,073:35— 32,320-80 
= 152-55 
6. Interaction (A x B) sum of squares = 930-45— 80-03 — 752:55 
= 97-87 


Table 5.2 Mean scores from table 5.1 


Cell means Difference 
me 8 in 
Methods (4) means method means 
1 2 (2-1) 
Superior | 30-00 34-17] 32:50 417 


24-75 29-92 5:317 


Levels (B) Average 


27:85 


21:60 —1:50 


boy Se cite n 
Overall mean — 26:80 


22:50 21-00 


Inferior 


Method means 25:17 27:89 


The between-cells sum of squares measures the variation that would 
remain if all the differences among pupils of the same method and leve 
were removed, i.e. if each score in table 5.1 were replaced by the mean 
score of the cell to which it belongs. It is the same as the sum of the squares 
of each cell mean expressed as a deviation from the overall mean (see 
table 5.2) and multiplied by the number of scores in the cell, 16: 
(30-00 —26-80)? x 4, plus similar terms for the other cells. Since there are 
six cells in all, the sum has 5 degrees of freedom. 

The between-methods sum of squares accounts solely for differences 
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between the two method means. It measures the variation that would 
exist if each score were replaced by the mean score of the method to which 
it belongs. Since there are only two methods, this sum has 1 degree of 
freedom. 

In the same way, the between-levels sum takes account of the differences 
between the three level means, measuring the variation that would exist if 
each score were replaced by the mean of the level to which it belongs. 
There are three levels, so the degrees of freedom are now 2. 

The salient feature is that these last two sums do not fully account for 
the between-cells sum of squares. A residual exists, the methods x levels 
(read *methods by levels") interaction. This is a measure of the extent to 
which the method differences for cach of the three levels differ among 
themselves. Thus, from table 5.2 we see that at both the superior and 
average levels method 2 is the more effective (mean differences 4-17 and 


5:17 respectively), whereas at the inferior level it is method 1 which is the 
more effective (mean difference — 1:50). It is the interaction sum of squares 


which takes account of the differences among these three mean differences. 
If, by some fluke, one method was found to be the more effective at all 
three levels and to precisely the same extent, the interaction sum would be 
zero. Since the between-cells sum of squares has 5 degrees of freedom, and 
1 and 2 of these are accounted for by the between-methods and the 
between-levels sums respectively, the interaction sum must have 
5—1—2 = 2 degrees of freedom. a 

The sums of squares and degrees of freedom for the four independent 
sources of variation, methods, levels, methods x levels interaction and 


within cells are set out in table 5.3. The mean square for methods, 
methods x levels interaction and within cells is also shown. The ratio of 
the mean square for interaction to that for within cells is obtained as 
ees 1349 
13:27 

which, for 2 and 39 degrees of freedom, is significant at the 5-per-cent level 
(statistical table 2A). We may conclude that there are real differences in the 
relative effectiveness of the methods at the different levels. - 

As a significant interaction has been established, there is really little 
point in testing the significance of the method differences, though even so 
it is often done as a matter of routine. Thus, testing the mean square for 


methods against that for within cells (see section 5.3) we would have 
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80-03 

— 1327 

which for 1 and 39 degrees of freedom is significant at the 5-per-cent level. 
To conclude that there is a real difference in the effectiveness of the 
methods, and that method 2 is the more effective (from the method means 
of table 5.2) would, however, be very misleading. Because of the significance 
of the interaction, no general statement about the effectiveness of the 
methods (i.e. one valid for all levels) can be made. As is apparent from 


table 5.2, method 2 is the more effective only for the superior and average 
levels. 


= 6:03 


Table 5.3 Analysis of variance of the data in table 5.1 
eee 


xo Sum o, Degrees 
Source of variation f 5 Mean square 
squares of freedom 


Methods (4) 80-03 1 80-03 
Levels (B) 7152:55 2 
Interaction, 4x B 97-87 2 48-93 
Within cells 524-75 39 13-27 
Total 1,455-20 44 


[SS ee eee 


It may seem reasonable to conclude, too, that the significance of the 
interaction results solely from the differences between the inferior level 
and the other two levels, i.e. there would not be a significant interaction 
if only the superior and average levels were considered. But this conclusion 
would not be justified. A methods x levels interaction confined to only two 
levels should be tested in a Separate experiment designed for this 
purpose. 

No test of the significance of the level differences is required, since the 
different levels were brought into the experiment for the sole purpose of 
reducing the mean Square for error. We have no interest in the level 
differences as such. If no account had been taken of levels, and if all scores 
in each method had been obtained from a random sample of pupils from 
the total ability range, the level sum of Squares (752-55) together with the 
interaction sum (97-87) would have been added to the within-cells sum 
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(524-75) to give the error sum of squares, i.e. the within-groups sum of 
a simple randomized-groups design. This error sum would have had 
2--2--39 = 43 degrees of freedom. The loss of 4 degrees of freedom due 
to the separation of levels is more than offset by the very large reduction 
in the sum of squares. 

We can now appreciate an essential feature of all experimental designs, 
namely that any reduction in statistical error through the control of an 
extraneous source of variation—intelligence, in this particular illustration 
—must be accompanied by a corresponding reduction in the estimate of 
error. If, in fact, the errors from an extraneous source of variance cannot 
be removed from the estimate of error, then it would be wrong to control 
this source of variation in the first place. Suppose, for instance, that in our 
agricultural field experiment (section 5.1) we discovered an extraneous 
source of variation other than that of soil fertility, that the right-hand side 
of the field (figure 4), say, was less favoured than the left-hand by being in 
the shadow of a building for part of the day. Now in the randomized 
arrangements of plots shown, three of the four plots for treatment D would 
then be on this less favoured side of the field. Why not, then, equalize as 
far as possible the position of the plots of each treatment with respect to 
this? (And, incidentally, if we had five blocks instead of four, the equaliza- 
tion could be made complete, see chapter 8.) This argument, though 
superficially plausible, must be rejected if we are to retain the particular 
design described. It is imperative that randomization within the blocks 
play an unfettered role. 

where I.Q. replaces natural 


Similarly, in our educational illustration, 


soil fertility as the controlled measure, it would not do to match pupils in 
the different method groups within each level with respect to industrious- 
tatus of the home in an 


ness, or introversion, or the socio-economic s the 1 
endeavour (which might well be successful) to reduce statistical error still 
further from the test results. A random selection within each level, and for 


each method independently, is essential. The discussion of this issue 
provided by Fisher (1951) cannot be bettered. 


5.3 The model 
Each score is now regarded as including a part characteristic of the 
particular method to which it belongs, a part characteristic of the particular 
level, and a part resulting from the interaction of the particular method 
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and level. We therefore write the model in the same notation as before 
(sections 3.3, 4.3 and 4.6) as 


Xie = M+A,+Bj+(AB)ij+ eijk 


where M is a component common to all the scores; 

A, is a component common to all the scores of method i; 

B; is a component common to all the scores of level j; 

(4B), is a component resulting from the interaction of method i and 
level j; 
and e;jx is a component specific to pupil k of method i and level j. 
Generally i would run from 1 to a, the number of treatments, andj would 
run from 1 to b, the number of levels (in the particular experiment 
described a = 2 and b = 3). If there were an equal number of scores 7 in 
each cell, then k would run from 1 to n. If, however, the cell numbers are 
not equal but proportionate, we would have k running from 1 to Mij» the 
n 

n; being subject to the conditions of proportionality, i.e. the ration 


being constant for all values of i, j and j’ & j) being fixed, and the ratio 
nyse 
T being constant for all values of j, i and i’ à i) being fixed. All five 


contributions to the score x;;, would be independent of each other, and 
the As, Bs, (AB)s and es would be regarded as drawn from normally distr? 
buted populations with means of zero and variances of o2, 62, 048 and € 

respectively. (If A and B are fixed effects—as, in fact, they are in the 
experiment of section 5.2—then 03, o2 and o2, must be regarded only 25 


x4 XB XX(BÓ 

alternative symbols for + P A and : jh i 
a— = =D- 

p. 58) (a—1)(b—1) 


respectively; S€* 


The restriction that the interaction components have a zero mean 
needs to be restated in the form that the mean—and hence also the sum— 
of the (4B);;s should be zero separately for all values of i (summing with 
respect to j) and also for all values of j (summing with respect to i). In 
terms of a rectangular arrangement of cells such as that of table 5.1, the 
interaction components must sum to zero in every row and in every column. 
The precise nature of interaction may, in fact, best be appreciated ia 
table is constructed with, say, three methods and three levels, table 5.4. 
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Table 5.4 A constructed table to illustrate interaction 


T ——————————————— 


Methods (A) Level 


Level means 
1 2 3 totals 
-— Lc mw eed EMI a cat C c 
5 5 5 
1 2 |-3 
4 4 4 
6 1 —1 
1 16 12 | -1 27 9 
5 5 5 
1 2 | -3 
0 0 0 
—4 1 3 
Levels (B) — | — | — 
2 2 8 5 15 $ 
5 5 5 
1 2 | -3 
—4 |-4 | -4 
—2 | -2 4 
3 0 1 24 3 1 
Method totals 18 21 6 45 Overall mean 
Method means 6 7 2 = 5 


Let us put the overall mean M equal to 5, and the population means of 
the methods A,, 4; and A; equal to 1, 2 and —3 respectively (so expressing 
them as deviations from M). The population means of the levels, B,, B; 
and B,, we will similarly put equal to 4, 0 and —4. In each cell of the 
table M is entered first, then the method component A (which is the same 
for all the cells in any one column), then the level component B (which is 
the same for all the cells in any one row). Finally components representing 
possible interactions are added. These could be any numbers whatsoever, 

G 
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provided that they sum to zero in every row and every column. We can 
therefore fill in only four of these components as we please, the remaining 
ones being then determined by the zero sums. This is why the degrees of 
freedom for interaction in such a table would be 4, and similarly why they 
were only 2 for table 5.1. Generally, with a methods and b levels, the 
degrees of freedom will be given by the product of the degrees of freedom 
for methods and levels separately, namely (a— 1)(b— 1). 

A point worthy of special note is that the differences in methods are 
unaffected by the level components (the method means in table 5.4 still 
have the postulated deviations of 1, 2 and —3 from the overall mean of 5) 
and, in the same way, the differences in levels are unaffected by the method 
components. In other words, the effects of methods and levels act indepen- 
dently of each other. 

But although they are independent, the methods and levels effects are 
not additive. Thus, to cite but one instance, method 2 overall is superior 
to method 1, and level 1 superior to level 2, yet the most favourable 
combination is not, as might be expected, method 2 acting at level 1. It is 
this lack of additivity—the failure of methods and levels in combination 
to determine the resultant effect at each cross-classification—which is the 
essence of interaction. Because of interaction, an overall difference 
between methods (or levels) is not preserved at each level (or for each 
method). The overall difference may well give quite a misleading indication 
of the separate differences. 

Similarly, the magnitude of the interaction is the extent of the failure 
of the sums of squares for methods and levels to equal the total sum o 
squares. In table 5.4 the sum of squares for methods is given bY 
[1? +2? +(—3)?]x3 = 42, and that for levels by [42 +0? +(—4)] x 3 = 96 
These contribute to, but do not equal, the total sum of squares, which is 
(16-5)? +(12—5)?+(—1-5)?+ --- +(2—5)? = 274. The amount re- 
maining is the measure of the interaction, a sum that could be obtaine 
directly from first principles as 67+ 1?4(-7)?4+ +--+ +4? = 136. 

The components analysis is shown in table 5.5, the appropriate mean- 
Square expectations appearing in column 1. An equal number of scores, ni 
in each cross-classification of methods (4) and levels (B) is assumed. (The 
difference for the case of unequal, but proportionate, group numbers 13 
that it is replaced by the effective group size calculated in a similar manne! 
to that described in section 3.4.) The mean-square expectation for inter- 
action appears as c? 4-025, this following in precisely the same way 8$ the 
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mean-square expectation for groups in the basic randomized-groups 
design, table 3.5. Again, with bn scores within each method, bno% appears 
in the mean-square expectation for methods, and similarly ano; in the 
mean-square expectation for levels. In neither case does the component 
nois appear. This is because both methods and levels are fixed effects, 
and so no sampling variation from the cross-classifications as such can be 
expected. (If the experiment were repeated with different pupils, the 
interaction components (4B);; would be precisely the same.) 


Table 5.5 Components analysis for a two-factor experiment 


Mean-square expectation 


Source of Degrees of 


Ass 1 2 3 
variation — freedom* dandi A fixed, Amah 
fixed B random random 
A a—1 |o?-bnoA  c?--nols--bnoà o?-F noi, bnol 
B b—1 |oc?-ancà c?-ranoi c? -- nci, -- ano 
AxB  (a—l)(b—l) o?+noin o? noy c? -- no2s 
Within cells ab(n—1) | o? o? o? 


* These are written for a levels of A and b levels of B. 


In contrast, suppose that one of the effects, B say, is random. This, of 
course, would no longer be a randomized-blocks design, but could, for 
instance, be a comparison of methods (4) in a number of randomly 
selected schools (B). The component c2; would then appear in the mean- 
Square expectation for the fixed effect, 4, because differences in 4 measured 
across the different categories of B would be subject to sampling variation 
from the cross-classifications. On the other hand, 025 would not appear 
in the mean-square expectation for the random effect, B, since differences 
inB are measured over the A categories which are constant, i.e. they would 
be precisely the same if the experiment were repeated. In such an experi- 
ment, therefore, the mean square for methods would have to be tested for 


significance against the mean square for interaction as the error term, not 
against the mean square for within cells. Í 
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For the sake of completeness the mean-square expectations when both 
A and B are random effects are included in table 5.5, though such a case 
would hardly ever arise in educational research. A full discussion of the 
design when all the effects, whether fixed or random, are important 1n 
their own right is postponed until chapter 6, together with the statement 
of a rule for writing down the mean-square expectations. Such designs are 
termed factorial designs, and the separate effects, such as A and B, factors. 
It is for this reason that table 5.5 has been described as giving the com- 
ponents analysis for a two-factor experiment. 


5.4 Unequal and disproportionate cell numbers 


Sometimes, through no fault of the investigator, equal or proportionate 
numbers of scores in various subgroups or cells are not obtained. In 4 
methods experiment, for instance, one or more pupils may become ill 
during the course of the experiment, or may be unable to take the final 
test for some other reason. The result would then be a disproportion in the 
subgroup numbers. The partitioning of the total sum of squares cannot 
proceed until this disproportion has been corrected. 

One way would be to allow for this possibility from the beginning, and 
to include in each subgroup slightly more pupils than would finally be 
needed. Exact equality or proportionality could be attained afterwards bY 
randomly rejecting scores from the unduly large subgroups. If only a few 
scores have to be rejected, this might be the most sensible method to adopt. 
But if a large number of scores have to be rejected, it can be criticized aS 
being wasteful, i.e. involving an appreciable sacrifice of data. 

With only one or two ‘missing’ scores the best procedure, therefore, 
might be to replace each missing score by the mean of all the obtaine 
scores in the particular cell or subgroup to which it belongs. It would then 
be necessary to subtract 1 degree of freedom for each of the scores replace 
in this way from the degrees of freedom for within cells, and also for total. 
(This would leave unaffected the degrees of freedom for treatments, leve P 
and interaction.) The calculation of the various sums of squares wou 
then proceed in the usual way. The tests of significance would only be 
approximate, but if only one or two scores have to be replaced, the tests 
would be sufficiently accurate for most purposes. "mes 

Another, slightly more refined, method of dealing with disproportion? : 
subgroup numbers is described by Fei Tsao (1946); it involves adjustin’ 
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the sums of squares of the scores from all cells, not merely those from cells 
in which the missing scores occur. The method is applicable when the 
subgroup numbers differ only slightly from proportionality. For methods 
of dealing with subgroup numbers differing markedly from proportionality 
the reader is referred to Snedecor (1956). 


5.5 The case of one score per cell 


Occasionally a design with just one score per cell is useful. This in terms of 
figure 4 means that each plot within a block corresponds to an individual, 
a complete block corresponding to a group of individuals (one for each 
treatment) matched with respect to the measure forming the blocks. These 
individuals may, for instance, have equal I.Q.s, or equal scores on a 
suitable pre-test, and they would be allocated at random to the different 
treatments in the same way as before. The partitioning of the total sum 
of squares would then be exactly the same as that of the between-cells sum 
of squares described in section 5.2 (see figure 5). There is now no variation 
within cells, since two or more scores per cell would be needed for this, 
and so no estimate of the population variance o? is possible. The method 
of calculating the sums of squares for treatments (or methods) and blocks 
(or levels) would remain as described, and the interaction sum of squares 
Would be obtained once again as a residual by subtraction. 

As the blocks would now be groups of matched individuals, itis usually 
reasonable to regard this effect as random rather than fixed. (Thus, to 
begin by deciding on certain fixed I.Q.s, and then selecting pupils for these 
Specified I.Q.s would be rather far-fetched. More usual would be the 
practice of taking a random group and then selecting others for individual 
matching. The components analysis would then follow from table 5.5 


column 2. With n = 1, the mean-square expectations for the three sources 
of variation would be as follows: 


Treatments, A: 0? -- o4 s-- bo? 
Blocks, B: c? -- aos 
Interaction, AB: 0? ois 


The mean square for treatments would then be tested for significance 
against that for interaction. No test for blocks is possible; but this would 
be only of academic interest anyway, as real differences among a random 
group of individuals may be assumed. Again, no test for interaction is 
possible, though again real individual x treatment interactions may usually 
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be assumed. The fact that o? by itself is not estimable may therefore be of 
little consequence. 

If, on the other hand, the blocks effect together with that of treatments 
is fixed, no test of significance is possible at all—as may be seen from 
table 5.5 column 1. This means that the component c2, must be uw. 
from the mean-square expectation for treatment as written above. Only 1 
we make an a priori assumption of zero interaction (a very questionable 
step) can the interaction mean square be used as the error term for testing 
the treatment differences. Of course, if the ratio of the mean square for 
treatments to that for interaction did exceed the appropriate F ratio 10 
statistical table 2, significance for the treatment differences could safely 
be claimed. The procedure, however, would too often lead to a false 
acceptance of the null hypothesis, a failure to detect real treatment 
differences. The use of a randomized-blocks design with one score per 
cell for a fixed-effects model should therefore be avoided. 

A further illustration of the case of one score per cell is when the X 
person undergoes all the treatments. This, of course, is only practicable A 
the treatments do not influence each other (which is hardly ever the case m 
educational investigations),* or if the influence of previous treatments s 
itself being investigated. The second Situation would arise, for instance, ! 
the treatments consisted of giving the same test—or equivalent forms 9 
the same test, the forms having been equated for difficulty—to a group © 
pupils, the purpose of the experiment being to study the effect of t 
Each pupil then takes all the treatments and thus corresponds not to à a 
but to an entire block of an agricultural field experiment. If, br 
equivalent forms of the test provide the different treatments, and par 
forms have not been equated for difficulty, it would be necessary to urn 
Systematically the order in which the forms are being administered, z 
that all do not take form 2 (say) after form 1, and form 3 after form 6 
This would introduce another dimension into the design—making at d 
longer a design with randomized blocks, in fact. A discussion of this ! 
resumed later (chapter 8). TET d 

Finally, it may have occurred to the reader that data from experime d 
such as that described in section 5.2 could also be analysed on the Lie i 
one score per cell by considering only the mean score of each d 
classification. Thus, if a methods experiment were conducted in seV® 


ic bY 
* For example, the same pupil could hardly be taught the same top!¢ 
more than one method. 
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randomly selected schools, the mean score for each method group in each 
School could provide the data for analysis. The mean square for methods 
would then be tested for significance against the mean square for interaction 
as before. Practical advantages would be that the method group sizes do 
not have to be equal, or proportionate—the analysis would assume that 
groups of the same size would have produced the same set of mean scores— 
and the method groups need not be randomly selected within the schools. 
Again homogeneity of variance need not be assumed within the method 
groups, an assumption which Lindquist (1940) argues is often not fully 
justified in method experiments anyhow. If, then, our interest is centred on 
the overall method differences, an analysis based only on mean scores 
might have much to recommend it. The main drawback is that no test for 
the significance of interaction would be possible, so that the proper 
interpretation of the method differences (even if they are statistically 


Significant) may not be possible. On balance the full analysis is to be 
preferred. n 


5.6 A test for non-additivity 


Basic to all analyses of variance is the assumption of additivity, i.e. that all 
the components into which a test score is resolved add together to give the 
obtained result. Components do not necessarily have to be combined in 
this way. They could, for instance, be combined multiplicatively, the 
obtained score being the product of the separate components. Tukey (1949) 
has developed a test of significance for non-additivity, a test which, if 
satisfied at some predetermined level (such as the 5-per-cent level), would 
lead to the conclusion that additivity is not a reasonable assumption. The 
test is illustrated from data shown in table 5.6. Scores for two treatments 
and eight blocks are shown, together with mean scores for all treatments 
and blocks and the deviations of these mean scores from the overall mean. 

A sum of products is calculated by multiplying each test score by the 
deviation recorded in its row (or column) and then summing for all scores 
in the column (or row). Thus, if we decide to multiply by the column 
deviations —0:75 and 0-75, the scores in each row would sum to the value 


shown in column 1 of table 5.7. Thus, for example, the first value in this 
column is obtained as 


25 x (—0-75) +28 x (0-75) = 2:25 
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These are then multiplied by the other set of deviations—the row aene 
in this case—set out again for convenience in column 2 of table 35 de 
products are seen to sum to 15:00. The reader may verify that if ^ Ber 
by multiplying each test score by the row deviation and sum S mE 
column, and then multiply the values so obtained by the column devia > 
the same sum of products will result. 


Table 5.6 Test scores for a design with two treatments and eight blocks 


m ee lG 


iation from 
Treatment Block Devia 
Blocks Y Block sum Tun oieri Wear 


1 25 28 53 26:5 7-00 
2. 2A 5s 49 24-5 5:00 
3 2. 96 48 24-0 4:30 
4 19 47 36 18-0 —1:50 
Sh — gl * ne 35 17:5 —2:00 
GIG T6 32 16-0 —3:50 
% 15 17 32 16-0 — 3-50 
8 aio 15 27 13-5 —6:00 

Treatment Overall sum Overall mean 

sum 150 162 = 312 = 19:5 

Treatment 

PA 18-75 2025 


Deviation from -045 0-75 
overall mean 


Se 


saian. Tf 
Sums of squares are calculated for the row and column deviations 
this example, therefore, we have 


7-00? 5-002 4. ... +(—6-00)? = 161-00 
for the sum of squares of the row deviations, and 
(—0-75)? + (0-75)? = 1-125 
for the sum of squares of the column deviations. The sum of squares for 


cts 
non-additivity is then obtained as the square of the sum of produ 
divided by both sums of Squares; i.e. in this example it is 
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Table 5.7 Calculations for a test of non- 
additivity (from the data of table 5.6) 


Weighted sum Row 
of row scores deviations Products 
1 2 Lx 2 
2:25 7-00 15:75 
0-75 5-00 373 
3:00 4-50 13-50 
—1-50 —1-50 2:25 
0-75 —2:00 —1-50 
0-00 —3-50 0-00 
1:50 —3:50 —5-25 
2:25 —6:00 —13:50 


Sum of products — 15:00 


Table 5.8 Testing the significance of non-additivity (from the data of 
table 5.6) 


Sum of | Degrees 


Source of variation squares of freedom Mean square 
Non-additivity 1:24 1 1-24 
Remainder 11:76 6 1:96 
Treatment x blocks interaction 13-00 T 

e ecc 
15-00)? 
( ) = 1-24 


161-00 x 1:125 

This sum is part of the interaction (treatment x blocks) sum of squares, 
and has 1 degree of freedom. The reader may verify that the interaction 
sum of squares for the data of table 5.6 works out as 13:00.* Subtracting 


* The total sum of squares is 344-00; the treatment sum i 
of s E m is 9-00 
block sum 322-00,so giving the interaction sum as 344-00 — 9-00 — 322-00 am a 
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the sum of squares for non-additivity from this gives a remainder sum of 
11-76, with 6 degrees of freedom (see table 5.8). It is this which provides 
the error term for testing the significance of non-additivity. 
1-24 
From table 5.8 F = <1, which shows that the effect of non- 
additivity is far too small for statistical significance. In other words, there 
is no evidence against additivity for the data of table 5.6. 


Weighted sum of block scores 


ey ieee es 


-6 -4 -2 0 2 4 6 
Deviations from overall mean 
Figure 6 Plot of the weighted sums of block scores against the 
block deviations from the overall mean (columns 1 and 2 of 


table 5.7). The unbroken line is the mean of the weighted sums, and 
the broken lines are confidence limits on each side of the mean. 


Ifthe Fratio had shown a significant non-additivity, a possible solution 
Would have been to transform the scale of measurement—though an 
examination of the data for possible ‘discrepant’ scores is recommended 
first. Such an examination would consist of plotting the weighted sum of 
the row scores (column 1 of table 5.7) against the row deviations (column 2 
of table 5.7). These are shown in figure 6. The mean of the weighted sum of 
Tow scores is also shown, together with limits set above and below this 
mean at a distance equal to 


2 (Sum of. Squares ofthe treatment deviations) x (Remaindermean square) 
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(In this illustration this distance would be 2/1-125x 1-96 = 2:92.) A 
‘discrepant’ score would be revealed by a point in the figure unusually 
high or low, i.e. outside the limits indicated by the broken lines. A trans- 
formation of the scale would be indicated if the points tended to lie on a 
straight line. The reader is referred to Tukey (1949) for further information. 
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Chapter 6 Crossing Designs: Factorial Arrangements 


6.1 Introduction 


In chapter 5 a two-way experiment was described in which the effect of 
blocks was separated from that of treatments in order to reduce statistical 
error. Different blocks were introduced, in fact, solely for this purpose. 
Sometimes, however, our interest would be equally divided between the 
two effects. We might have two distinct sets of treatments, for example— 
treatments A being, say, particular methods in which a topic in a school 
subject is taught, and treatments B the ways in which the groups taught 
are subsequently examined (e.g. by essay-writing, objective tests, oral 
questioning). Each set of treatments would then be called a factor, and the 
experimental design a factorial design. 

A factorial design need not be limited to two factors. Any number of 
factors can be incorporated. The essential feature is that each factor is 
investigated in all the combinations of the other factors. A. design with 
three factors is described in section 6.2. The randomized-blocks design of 
chapter 5 would correspond to a factorial design with two factors. It would 
be unusual to describe the design in this way because we would have little 
interest in the effect of the blocks as such. 

Possible factors in educational investigations would be methods of 
teaching, methods of testing, occasions and conditions of testing, grades of 
intelligence of the testees, and the motivational and personality charac- 
teristics of the testees. The classifications based on each of the factors are 
termed /evels. With below-average, average and above-average classifica- 
tions for I.Q., for instance, the factor of intelligence would be said to have 
three levels. Levels, however, need not be ‘ordered’, i.e. capable of being 
quantified: a possible factor in an investigation of reading ability, $Y; 
could be the style of type of the reading material used (roman, gothic, 
clarendon, etc.), each particular style of type then being referred to as 2 
level of the factor of style. All the factors of a factorial design have two oF 
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more levels, and the complete arrangement is described by their product. 
Thus, a 2x 3x4 design would refer to an experiment with three factors, 
one being investigated at two levels, one at three, and one at four. 

It is instructive to contrast the factorial experiment with the one-time 
‘ideal’ of scientific experiment, that of varying the different factors or 
conditions one at a time (see Fisher 1951). The factorial experiment not 
only provides all the information possible from the separate one-factor- 
varied experiments (and, moreover, does so more efficiently); it also pro- 
vides information about the interaction of the factors. This is an important 
advantage, in that knowing only the effect of a factor with all the others 
held constant provides no indication at all of whether this effect remains 
the same at all levels of the other factors. When the number of factors is 
three or more, there is more than the single interaction of the type already 
described (section 5.2). The partitioning of the sum of squares becomes 
complex. A three-factor experiment will therefore be described in some 
detail, and the extension to experiments with more than three factors may 
then be readily understood. 


6.2 A three-factor experiment 


An investigator believes that children will perform differently on a certain 
task according to various personality traits (in particular extraversion and 
anxiety) and also according to the conditions under which the task is set 
(and more especially if these conditions involve externally imposed stress). 
Children are therefore divided into two categories, say, with respect to both 
anxiety (anxious and non-anxious children) and extraversion (extraverted 
and introverted children) either by the ratings of experienced observers or 
by the children's scores on suitable questionnaires (such as the Maudsley 
Personality Inventory for the assessment of introversion-extraversion), 
and random groups for each of the four cross-classifications are selected. 
We will suppose that the performance of the task does not depend to any 
appreciable extent on intelligence, or, alternatively, that the children are 
sampled from a narrow range of I.Q.s.* 

Three levels of conditions are proposed, one being a neutral condition 
in which the children attempt the task without any induced stress: they 


* If the effect of intelligence was considerable, and if the children's I.Q.s 
ranged widely, intelligence would have to be brought in as an additional factor 
(in two or more levels), making the experiment one of four factors. 
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Table 6.1 Scores of twelve groups, in a 2x 2x 3 factorial design 


Overall total 
= 1294 


Totals A,| 654 B, | 649 
4,|640 B, | 645 
N.B. The total score of each cell is shown circled. 


Key: A,-Introverted A,-Extraverted | 
DB;-Non-anxious B;-Anxious E 
C;-Neutral C;-Slight stress C;-Severe stres: 
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could be told, for instance, that the task was given them merely to find out 
whether it was suitable for their age group, and that if they did not do well, 
it would mean merely that the task was not suitable. A second condition 
would be one in which slight stress would be introduced: thus, the children 
could be told that they were expected to do well. In a third condition the 
Stress could be made more severe, possibly by telling the children that if 
they did not do well, it would mean they were not suited to their present 
class and might be moved elsewhere. Separate, independently selected 
groups—from each of the anxiety-extraversion cross-classifications— 
would be tested under these conditions. The experiment would then be a 


2x2x3 factorial experiment, involving twelve independently selected 
groups in all. 


Total variation 


Between cells Within cells 


á. B C AXB AXC BxC AxBxC 
YY 


First-order Second-order 
interactions interaction 


Figure 7 Breakdown of variation in a three-factor experiment 


Suppose that five children are selected for each group—in an actual 
experiment of this nature it would be advisable for the group numbers to 
be larger, though as before the group numbers would have to be equal or 
proportionate—and that the test scores are as shown in table 6.1. Each of 
the sixty scores can be classified in three ways, as belonging to particular 
levels of extraversion (A), anxiety (B) and conditions of testing (C). Our 
interest is in the differences between the levels of each of these factors, as 
well as in the differences arising from the interactions of these factors. 
There will now be three interactions of the type described in chapter 5— 
first-order interactions, as they are termed—since extraversion interacts with 
anxiety (4 x B), and with conditions (4 x C), and anxiety also interacts 
with conditions (B x C). The sums of squares for all three interactions are a 
part of the variation between cells (each group of five Scores being termed 
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a cell, as before), as also are the sums of squares for differences between 
the levels of each factor in turn. All these sums, however, do not exhaust 
the variation between cells. A residual sum is left. This sum measures à 
second-order interaction, that of extraversion, anxiety and conditions 
(A x B x C) together (see figure 7). 


Table 6.2 Total scores for the cross-classifications of A, B and C taken 
two at a time (from table 6.1) 


() AxB A, 45 
B, | 347 | 298 
B, | 307 | 342 


Each total is the sum of 15 scores. 


(ii) AxC 
228 
199 
Each total is the sum of 10 scores. 
Gii) Bx C B ^B 
230 234 
216 214 
203 197 
Each total is the sum of 10 scores. 
Key: A,-Introverted A;-Extraverted 
B,-Non-anxious B;-Anxious 
C;-Neutral C;-Slight stress C;-Severe stress 


The calculation is as shown below, To facilitate the calculation of the 
sums of squares for the first-order interactions, table 6.2 has been com- 
piled; this shows the total scores in the cross-classifications of the factors 
A, B and C taken two at a time. Thus, in section i, showing the A xB 


en, each cell shows the total of the scores for all levels 
of C. 
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1294 x 1294 
1. Total sum of squares = Q9 j-e 
Sum of 60 terms 
= 28,684— 27,907-26 
= 776-74 
2. Between-cells sum of squares = (0 im. e$) ET 
i 5 5 5 60 
Sum of 12 terms 
= 28,136 —27,907-26 
= 228-74 
3. Within-cells sum of squares = 776:74—228-74 
= 548-00 
4. Between A levels sum of \ _ 654? 840% _ 1294? 
squares 30 30 60 
= 27,910-53 —27,907:26 
= 3-27 
5. Between B levels sum a » Lua 659? 1294 
Squares 30 30 60 
= 27,907-53 —27,907:26 
= 0-27 
6. Between C levels sum = E 464? 430° 400? 1294? 
squares 20 20 20 60 
= 28,009-80—27,907-26 
= 102-54 
7. Ax B interaction sum of 347? 298? 342? 
squares (see table 6.2 i) } E GE st. | Tds eom 


Sum of 4 terms 


1294? 
(Dy cial 
27 © 
= 28,028-40—3-27—0:27—27,907-26 
= 117-60 
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8. Ax Cinteraction sum of _ (236 | 228? | BES 199? 321 
squares (see table 6.2 ii) 10 ' 10 10 
Sum of 6 terms 
1294? 
—102:54— —— 
02-54 60 
= 28,014-00—3-27 — 102-54—27,907:26 
= 0-93 
9. Bx C interaction sum of . (230° 234*. —  ,19T —027 
squares (see table 6.2 iii) E e tT to 
Sum of 6 terms 
1294? 
—102 SERRA 
- 28,012-60—0-27 — 102-54 — 277,907-26 
10. 4 ; . = 2°53 
posee sum = 228-74—3-27-0-27— 102-54 —117:60 
quares 
—0:93—2-53 
= 1:60 


The calculation of the sums of squares for extraversion (A), anxiety (B) 
and conditions (C)—for the main effects, as they are termed—follows the 
same pattern as before. The sum of squares for each of the first-order 
interactions is calculated by summing the squares of the cell totals of the 
appropriate table (table 6.2 i, ii and iii), subtracting the usual correction 
term and also the sum of squares for each of the two main effects involved. 
(Thus, for the A x B interaction the sum of squares for both A and B must 
be subtracted.) In the same way, when calculating the sum of squares for 
the second-order interaction—the appropriate cell totals now being those 
of table 6.1—the sums of squares of all the main effects, 4, B and C, and of 
all the first-order interaction effects, 4 xB, Ax C and Bx C, must be 
subtracted. The degrees of freedom for the first-order interactions follow 
the multiplicative rule mentioned earlier (section 5.3); ie. they are the 
product of the degrees of freedom for the two relevant main effects. In the 
same way, the degrees of freedom for the second-order interaction is give? 
by the product of the degrees of freedom for all three main effects. . 

The sums of squares and degrees of freedom of all effects are set out 17 
table 6.3. The mean squares are also given. Taking the within-cells mea? 
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square as the estimate of error,* we see that all effects except two (c and 
Ax B) are clearly insignificant, the F ratios being less than 1. The ratio of 
the mean square for the A x B interaction to that for within cells is obtained 
as 


117-60 
- = 10-30 
F 11-42 
This for 1 and 48 degrees of freedom is significant at the 1-per-cent level 
(statistical table 2B). 


Table 6.3 Analysis of variance of the data in table 6.1 


Source of Sum of Degrees Mean square 

variation squares of freedom 
A 327 1 3:27 
B 0:27 1 0:27 
c 102-54 2 5i27t 
AxB 117-60 1 117-60} 
AxC 0:93 2 0:46 
BxC 2°53 2 1:26 
AxBxC 1-60 2 0-80 
Within cells 548-00 48 11-42 
Total 7116:74 59 


t Significant at the 5-per-cent level. 
1 Significant at the 1-per-cent level. 


To interpret a statistically significant interaction, a table of mean 
Scores for all the cross-classifications is usually prepared. However, with 
equal numbers in each cross-classification the table of total scores will 
serve instead. We see from table 6.2 i that the interpretation of the 4x B 
interaction is strikingly clear. Whereas for anxious children introverts 
Score more highly than extraverts, for non-anxious children it is the 
extraverts who do better. There is little point in comparing the total (or 


* For a fixed-effects model the within-cells mean square provides the estimate 
of error for all the other effects (see section 6.3). 
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mean) scores of the introverted and extraverted children—or, again, of the 
anxious and non-anxious children—as a whole. The introverted children 
(total score 654) do better than the extraverted (total score 640), though 
from table 6.3 we see that the difference is not significant. Even if it were 
significant, however (and a main effect could be significant at the same 
level as a related interaction), the conclusion that introverts do better than 
extraverts would not have a general validity. It would be limited to children 
who are anxious. 

In precisely the same way, a significant second-order interaction would 
limit the scope of any first-order interaction involving the same effects. If 
the Ax Bx C interaction happened to be significant in this investigation, 
it would mean that the conclusion drawn from the significant Ax B 
interaction, namely that both the anxious introverts and non-anxious 
extraverts do better than the other two groups, would not apply to all the 
levels of C (conditions of testing). If might be found to apply, for instance, 
only to the neutral and slight stress conditions. As the A x B x C interaction 
is insignificant, however, we may conclude that the superiority of the 
anxious introverts and non-anxious extraverts applies at all three condi- 
tions of testing. An inspection of the totals for A x B cross-classifications 
picked out from table 6.1 for the C,, C, and C, levels in turn readily 
confirms this. 


The ratio of the mean square for C to that for within cells is 
25127 
142 


This, for 2 and 48 degrees of freedom, is significant at the 5-per-cent level 
(statistical table 2A). From the totals recorded at the bottom of table 6.1, 
we see that the slight stress condition provides lower scores than the neutral 
condition, and the severe stress condition lower scores still. Increasing 
stress has a deleterious influence. Also, since there is no significant inter- 
action involving C, we may conclude that the effect of stress applies equally 
to anxious and non-anxious, introverted and extraverted children alike. 

The significance of the separate differences between the scores at the 
three levels of C could also be determined in the manner described 1n 
section 3.2. The reader is invited to verify that of the three separate 
differences, only that between the scores at C, and C3, the neutral an 
severe stress conditions, is significant at the 5-per-cent level. 


= 4-49 
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6.3 The model 


With the same notation as that for the model of the two-factor experiment 
developed in the last chapter (section 5.3), the model may now be set down 
as 


Xi; = M-F- Ai Bj C (4B); (AC) + (BO) jx, (ABC); ji eijk 


Where M is a component common to all the scores; 
A, is a component common to all scores in level i of factor A; 
B; is a component common to all scores in level j of factor B; 
C, is a component common to all scores in level k of factor C; 
(4B);; is a component resulting from the interaction of level i of factor 
A and level j of factor B; 
(AC), is a component resulting from the interaction of level i of factor 
A and level k of factor C; 
(BC), is a component resulting from the interaction of level j of factor 
B and level k of factor C; 
(ABC), ,, is a component resulting from the interaction of level i of factor 
A, level j of factor B and level k of factor C; 
and e;j, is a component specific to the score of person / in level i of factor 
A, level j of factor B and level k of factor C. 
Generally i runs from 1 to a, j from 1 to b, and k from 1 to c (in the 
experiment just described a — 2, b — 2 and c — 3), and with an equal 
number of scores z in each cell / runs from 1 to n. The nine contributions 
to the score x;,,; are all independent of each other, and the As, Bs, Cs, 
(AB)s, (AC)s, (BC)s, (ABC)s and es are regarded as being drawn from 
normally distributed populations with means of zero and variances of 
65, 03, 62, 025, 02c, Gic, opc and o? respectively (see p. 96). 

The components analysis for the case of all three main effects being 
fixed is shown in table 6.4. The mean-square expectations are a natural 
extension of those for the two-factor experiment (table 5.5 1). With each 
of A, B and C fixed, there is no sampling variation from the cross- 
classifications of 4, B and C to influence the cross-classifications taken two 
at a time—i.e. the first-order interactions—and no sampling variation 
from any of the cross-classifications to influence the main effects. Notice, 
too, the balance between subscripts and coefficients: the total number, 
Subscripts plus coefficients, remains the same; and if a particular letter 
(A say) does not appear as a subscript, the corresponding lower-case letter 
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(a) appears as a coefficient. It is apparent from table 6.4 that the within- 
cells mean square is the appropriate error term for testing the significance 
of all the other effects. 


Table 6.4 Components analysis for a three-factor 
experiment, all three main effects being fixed 


————————————————— 


Source 0, Mean-square 

uei Degrees of freedom* ne 
i DC e UTR M CERES 
A a—i c? 4- nbcoÀ 
B b-1 c? -- nacos 
(e; c—1 c? -- nabot 
AxB (a—1)(b—1) c? -- ncoáp 
AxC (a—1)(c—1) c? +nborc 
BxC (b—1Yc—1) c? -- na0$c 
AxBxC (a—1)(b— 1)(c—1) c? -- nonc 
Within cells abc(n— 1) c? 


$e 


* These are written for a levels of A, b levels of B, c levels of C, and for 7 
scores in each of the abc groups. 


We often have to employ a three-way analysis when one of the effects 
is random. This would be the case if a two-factor experiment—involving 
methods and levels of intelligence as main effects, for instance—wer® 
replicated in a number of schools. Schools would then enter into a three- 
way analysis as a random effect. Another possibility—though one far less 
likely to arise in educational investigations—is that two of the three main 
effects are random. To write out the set of mean-square expectations OF 
each of these cases, it is best to begin with the mean-square expectations 
when all three main effects are random. These are given in table 6.5. In 
contrast to table 6.4, we see that the component noc enters into all the 
first-order interactions, and that each of the main effects, too, contains 
oAsc as well as the components from the two first-order interactions in 
which it is involved. The balance between subscripts and coefficients 
remains as before. 

Suppose now that one of the effects, C say, is fixed. Certain of the 
components in the mean-square expectations then have to be deleted. 
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Schultz (1955) provides a convenient rule for doing this. The rule may be 
stated as follows: 


1. Retain the last component in each line, the component designating 
the particular effect under consideration. 

2. Retain also the first component in each line. 

3. Of the remaining components delete those with any subscript 
representing a fixed effect other than the particular effect under 
consideration. 


Table 6.5 Mean-square expectations for a three- 
factor experiment, all three main effects being random 


P rH — À M ———. 


Source of Mean-square expectation* 
variation 
E ee eee ee 
A c? -- no3gc neo - nba c beo 
B c? - no3 gc +NCo% n+ Naoko 4 nacos 
Cc c? -- noA sc nbolc - naosc 3 naboe 
AxB c? 4- noA gc -- ncoás 
AxC c? 4- noA gc boc 
BxC c? 4- noA sc t na0sc 
AxBxC c? -- noa sc 


Within cells — o? 


ee 


$ These are written for a levels of A, b levels of B, c levels of C, and for n 
scores in each of the abc groups. 


Consider line A in table 6.5. The component noc contains subscript 
C, which represents a fixed effect and which is not the effect under con- 
sideration (4); so this component must be deleted. So, too, must the 
component nbo%¢ in this line. Similarly, in line B the components nore 
and naoz¢ must be deleted. In line C, on the other hand, the component 
noc has to be retained, since C is now the effect under consideration. 
No component in this line is affected by the rule. The only other com- 
ponent to be deleted is the component ncÀgc in the line Ax B. The 
resulting mean-square expectations are set out in table 6.6. We see that 
the Ax B interaction has now to be tested for significance against the 
within-cells mean square; that the A x C and B x C interactions have to be 
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tested against the Ax Bx C mean square; and that the 4 and B effects 
have to be tested against the A x B mean square. No error term for testing 
the significance of C is provided. An approximate method, based on 
adding together the C and within-cells mean squares for comparison with 
the sum of the 4x C and Bx C mean squares, is suggested by Snedecor 
(1956). 


Table 6.6 Mean-square expectations for a three- 
factor experiment, mixed model, A and B being 
random and C fixed 


ee Ro Á— —————— MÀ E 


po. = Mean-square expectation* 

ender a o 
A c? nca s nbcaa 

B c? -nco^s-- nacos 

C 6? nel sc nbo3c - nacác - naboe 
AxB o? 4- ncoàs 

AxC c? -- no? sc nboAc 

BxC 6? -- noA ac - a0 sc 

AxBxC o*+nor8c 


Within cells ø? 


* These are written for a levels of A, b levels of B, c levels of C, and for ? 
Scores in each of the abc groups. 


If, in addition to C, B is also fixed, further deletions must be made. 
In line A of table 6.6, for instance, the component zico2s contains subscript 
B and so has to be deleted. The reader will have no difficulty in showin® 
that the mean-square expectations will then be as given in table 6.7. We 
- see that the A and A x C effects, as well as those of A x B and AxBx - 
have now to be tested for significance against the within-cells mean squares 
that the B x C interaction still has to be tested against the A x Bx C mean 
square; that the B effect is still tested against the A x B mean squares m 
that C can now be tested against the 4 x C mean square. If, finally, the 
effect is also fixed, further application of Schultz’s rule would yield 5 
mean-square expectations already shown in table 6.4. 
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Table 6.7 Mean-square expectations 
for a three-factor experiment, mixed 
model, A being random and B and C fixed 


foe. WR SS 


Source of Mean-square 
variation expectation* 
A c? +nbco4 
B o? t nco s - nacos 
[ol c? 4- nboA.c -- naboe 
AxB c? - ncoán 
AxC c? 4 nboác 
BxC c? - noA sc Nadie 
AxBxC c? 4 n6Ànc 
Within cells o? 


pd NNNM NEM EE Lc ee 
* These are written for a levels of A, b levels of B, c levels of C, and for n 
Scores in each of the abc groups. 


6.4 Orthogonal comparisons 


Each of the sources of variation separated out in table 6.3 may be derived 
from an orthogonal comparison of the various cross-classifications. The 
variation represented in the first seven lines of the table—accounting for 
11 degrees of freedom—may, in fact, be derived from eleven independent 
orthogonal comparisons. The total scores in each of the cross-classifications 
of A, B and C (taken from table 6.1) are set out again in table 6.8, at the 
head of the columns giving the 2 coefficients for the comparisons (see 
section 3.5). We note that the sum of the coefficients in any one row is 
zero, as is also the sum of the products of corresponding coefficients in any 
two rows. Multiplying each coefficient by the score at the head of the 
column and then adding for each row gives the comparison sums (c) 
shown on the right. For the rows 4 and B these c sums could have been 
obtained from the main-effects totals in table 6.1, ie. 4; —45 = 14 and 
B,— B, — 4. Two rows have been written for the main effect C. This is 
because C has 2 degrees of freedom, and these have been resolved 
(arbitrarily) into (a) 1 degree of freedom for the difference between the 
level C, and the average of the levels C; and C3, and (b) 1 degree of freedom 
for the difference between C; and C3. 
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Table 6.8 Orthogonal comparisons for the 2x 2x 3 experiment (table 6.1) 


Total cross-classification scores 


4ıBıCı A1BiCz A1BıC3 A,B2C1 A\B2Cz AyB,C3 A2BiCi 
Comparison 111 102 94 125 115 107 119 


-1 -1 -1 -1 =j =f 1 

-1 -1 -1 1 1 1 =i 

E @ 2 -1 -1 2 ex =i 2 
G) o0 1 -1 0 1 -1 0 

AxB 1 T 1 e EN. | =j] -1 
AxC -2 1 1 -2 1 1 2 
Gi) 0 -1 1 0 zz 1 0 

BxC GQ -2 1 1 S c -1 -2 
G) o0 -1 1 0 1 -1 0 

AxBxC () 2 =I zi -2 1 1 =e) 
G) o Ü o4 0 -1 1 0 


347 | 298 


307 | 342 


The coefficients for the interactions have been obtained by multiplica- 
tion. For the Ax B comparison, for instance, corresponding coefficients 
in the A and P rows have been multiplied, and for each of the 4x BX 
comparisons corresponding coefficients in the A, B and the particular 
TOW have been multiplied. The comparison sum for A x B could have bee? 
obtained from table 6.2 i, reproduced above for convenience, 
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Table 6.8 (continued) 


——  —————————— 


Total cross-classification scores Sum of 
squares 
A2B,C, A2B,C3 A2B2C, A2B2C2 A2B2C3 Sum c? 
114 109 109 99 90 (c) b n> 
1 1 1 1 1 14 12 327 
aL -1 1 1 1 —4 12 0:27 
-1 -1 2 xd -1 98 24 80-03 
102-53 
1 -1 0 1 -1 30 8 22:50 


l 
Ši 
l 

- 
- 
m= 
= 


—84 12 117-60 


=1 - 2 -1 -1 —10 24 0-83 
0-93 

1 -1 0 i -1 —2 8 0-10 

1 1 2 -1 -1 16 24 2:13 
2:53 

=i 1 0 1 =1 4 8 0-40 

1 1 2 -1 -1 12 24 1:20 
1:60 

=I 1 0 il — 4 8 0:40 


The difference B,— B, at level A, is 298—342 = —44, and the same 
difference at level A, is 347—307 = 40; so the change in the difference is 
—44—40 = —84, as given in table 6.8. 

The sum of squares for each comparison is evaluated by the formula 


ni? > 4? being the sum of the squares of the coefficients, and n the 


number of scores in each group, 5. The sums for the two separate com- 
parisons of C, Ax C and Ax Bx C have been combined. All the sums of 
Squares are seen to agree exactly with those previously obtained (table 6.3), 
except for a small discrepancy in one instance (0-01) due to rounding error, 

The orthogonal comparisons for the case of A, Band Call having two 


Table 6.9 Orthogonal comparisons for a 2x 2x2 experiment 


Total cross-classification scores 
t Sum 


Comparison AB. Ci A,B,C, | A,B,C, | A4,B5B; | 45B,C, | A,B,C, | 45B5C, | A5B5C5 | (c) 
d5b,C5 | a35b56, 


vct 


uouvanpy ui usisag [pjuauun42dxg 
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levels is of special interest, in that each comparison then represents one 
of the between-cells sources of variance in the analysis of variance. The 
comparisons are set out in table 6.9. If the total scores for the various 
cross-classifications are shown by the appropriate lower-case letters, then 
the comparison sum for A appears as 


Ca = —a,b,c,— a,b, —a,b50, —Ayb2C2 + a5b,0, - a5b,c5 +42b2¢, 
c asb5c; 
This we can represent formally as 
Ca = (a;—a1) b; +b (er c1) 


since if this expression were expanded algebraically, the former expression 
would be obtained. 


In the same way, the comparison sums for B and C may be represented as 
Ca = (az +4;)(b2—bs)(C2 + c1) 
cc = (a5 - a,b b1)(e2 — 4) 


. The comparison sums for the interactions may also be represented in 
this manner. The A x B sum, for instance, is 


CAx p = d4b,0C,- a4b,05—a,b5c, —a,b505 — a5b,0, — a5b405 - a5b50, 
-F abc, 


Which may be represented by 


CAxB = (a5—a4)(b5 —b1)(c24- c1) 
Similarly we may write 


and 


CAxc = (a5—a4)(b5 - b1)(c2 — 01) 
Caxc = (a5-a4)(b; — b1Y(c2 — 01) 


Finally, the comparison sum for the second-order interaction, which is 
given by 


and 


CAxn«c = —ajb,0,-- a4b,05-- a4b5c, —a,b5054- a5b,0,— a5b,05 
—a5b5c,- a5b;c; 
may be represented by 


CAxBnxc = (a5—4a,)(b5 — b1)(2 — c1) 
Similar expressions may be written for main and interaction effects of 
experiments with more than three factors. Discussion of this now follows. 
A further simplification adopted by many writers is to represent the 
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upper level of the factor by the lower-case letter and the lower level by the 
absence of the letter. Thus, the cross-classification sum for A,B,C, would 
be written as abc, the sum for A,B,C, as bc, the sum for A,B,C, as c and 
the sum for A,B,C, simply as (1). The comparison sum, Ax Bx C, for 
instance, would then appear as (—1)+c+b—be-+a—ac—ab+abe, 
which may be represented by (a— 1)(b— 1)(c— 1). 


6.5 Higher dimensional designs 


When the number of factors exceeds three, the number and variety of the 
different effects increases rapidly. With four factors (A, B, C and D) we 
would have in addition to the four main effects six first-order interactions 
(Ax B, AxC, Ax D, BxC, Bx D and Cx D), four second-order inter- 
actions (4x Bx C, A XBxD, AxCxD and Bx Cx D) and one third- 
order interaction (4x Bx Cx D). Similarly, with five factors we would 
have five main effects, ten first-order interactions, ten second-order 
interactions, five third-order interactions and one fourth-order interaction. 
The degrees of freedom for these effects follow the same pattern as before. 

Each higher-order interaction, if statistically significant, will limit the 
scope of any conclusion drawn from a lower-order interaction (involving 
the same factors), just as a significant first-order interaction will limit the 
scope of a related main effect. Higher-order interactions may be interpr eted 
in a number of ways. The third-order 4x Bx Cx D interaction, for 
instance, may be regarded as the interaction of 4 x Bx C with D, i.e. a8 
differences in the interaction A x Bx C at the various levels of D, or as the 
interaction of A x Bx D with C, or as the interaction of A x Cx D with B, 
or as the interaction of Bx Cx D with A. It may also be regarded as the 
interaction between the two first-order interactions A x B and C x D—0F 
again, between 4x C and Bx D, etc.—ie. as differences in the AX B 
interactions as different levels of both C and D are taken in turn. 

An orderly arrangement of the data can be made without difficulty, 
though the actual computations become increasingly laborious. Take, for 
example, a third-order interaction (Ax Bx Cx D) in a five-factor design. 
The sum of squares of the cell totals from the appropriate table is obtaine 
—the table giving the cross-classifications of the four factors A, B, C 2? 
D, the total score in each cell being the sum for all levels of E—and fron? 
this sum is subtracted the sum of squares for each of the four main effects 
involved (4, B, C and D), the sum of squares for each of the six first-ord 
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interactions involved (4x B, Ax C, Ax D, Bx C, Bx D and Cx D), and 
the sum of squares for each of the four second-order interactions involved 
(4x BxC, Ax Bx D, Ax Cx D and Bx C x D), and finally the correction 
term.* 

Because of the arithmetical labour of a complete analysis, a frequent 
practice has been to calculate sums of squares for the main effects and 
first- and second-order interactions only, taking the mean square from the 
Tesidual sum of squares as the error term. This practice assumes that all the 
higher-order interactions are insignificant, or, more precisely, that all the 
Population variances of type oc, Cinco, etc., are zero. If this happens to 
be true, and if the model has only fixed effects, then the mean square from 
each of the unseparated interactions estimates the same population variance, 
a”. By taking the mean square from the residual sum—a composite of all 
these interaction sums and the within-cells sum—one is in effect pooling all 
the separate sums to get a better estimate, i.e. one based on a larger number 
of degrees of freedom. If, on the other hand, one or more of the population 
Variances oc, c2 cp, etc., is not zero, the residual mean square can be 
expected to over-estimate c? (for a fixed-effects model) and significance 
tests based on it will fail too frequently to detect real differences. Unless, 
therefore, there is reason to believe (from previous experiments) that all 
third- and higher-order interactions are in fact negligible, this practice is 
not recommended. 

For a model with one or more random effects, the choice of error term 
for any particular effect must be worked out systematically by writing 
down the full components analysis—i.e. an analysis for a model with only 
Tandom effects—and then deleting appropriate components in accordance 
with Schultz’s rule. Interactions between the random and fixed effects are 
generally to be expected, and the use of a residual mean square derived 
from an incomplete analysis should again be discouraged. 

Complete factorial designs with four or more factors are, in fact, seldom 
employed in educational research. This is because with a large number of 
factors, and also with a large number of levels for each factor, the experi- 
ment becomes unwieldy. Even if we tested only five factors, each with only 
two levels, we would have 2° or thirty-two cross-classifications in all, i.e. 
thirty-two separate groups for testing. The administrative and other 
Practical difficulties involved might then be prohibitive. 


* For variants of this method of calculation, which may Occasionally be 
useful, see Edwards and Horst (1950). 
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An obvious advantage would be to have all levels of some of the factors 
administered to the same group of testees, if this is possible. (For many 
factors it is not: different methods of teaching, for instance, inevitably 
necessitate different groups.) Designs which allow all levels of some of the 
factors to be administered to the same testees are described in chapter 7. 
When the same group cannot be used for more than one level of a factor, 
an ‘incomplete’ factorial design may sometimes be helpful. This is 8 
design in which some of the information possible from the complete 
factorial design is sacrificed. Usually the information sacrificed will hs 
higher-order interaction expected, from previous experiment in the ien 
to be negligible (factorial experiments with confounding), though it coul 
also involve a main effect (the split-plot design). Detailed accounts are 
provided by Cochran and Cox (1957).* 

* The basic principle is the use of blocks, similar to that of a randomized" 
blocks design (figure 4), each block containing only some of the possible pee 
classifications, but with the differences between blocks being eliminated px: 
Statistical error as before. Thus, suppose that two such blocks for a 2 x2 " 
experiment are (in the notation suggested at the end of section 6.4) as follows: 


BLOCK 1 


BLOCK 2 


LI E-]ES ES ] 


E : owe! 
Two of the plots in each block contain the upper level of 4, and two the J 


ck: 
level of A. Hence the A effect is unaffected by the differences between the pr. 
Similarly, we can verify (from table 6.9) that each of the B, C, A x B, A gi Bx 
B x C effects is also unaffected by the difference between the blocks. The 4 bc - 1» 
effect, however, is based on the comparison abc +a+b+c—ab—ac— is c 
ie. block 1 plots minus block 2 plots. The second-order interaction 1 
founded (inseparable in its effect) with the difference between the blocks. 
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Chapter 7 Designs with Nesting and Crossing 


7.1 A single crossing of the nested factor 


To illustrate a basic design involving both nesting and crossing, a study in 
the field of programmed learning (Lewis and Gregson 1965) may be taken. 
The study is concerned with the effect of frame size on learning from à 
linear programme. It is generally accepted that a sequence of fairly small 
steps is necessary for the mastery of complex material, yet differing views 
may be held as to how much material can best be presented at the same 
time. The same basic programme—one dealing with the history of number 
—was therefore presented in three versions according to the size of frame, 
large, medium and small. Pupils drawn from three ability groups (relatively 
high, medium and low L.Q.s) were selected, equal numbers from each 
group being allocated at random to each programme frame size. After the 
programme was worked through, an immediate test was given. The ey 
test was given again just over a week later, and also on a third occasion i: 
month later. Scores were therefore obtained from three LQ. groups, with 
respect to three frame sizes on three occasions. The scores are set out 1 
table 7.1. 

We see that the experiment resembles a 3 x 3x 3 factorial experiment 
frame size, LQ. and occasions being the three factors. One pose 
difference, however, should be noted between a comparison of the differen 
levels of frame size and I.Q. on the one hand, and of the different levels a 
occasions on the other. Different levels of frame size and I.Q. contain di 
scores of different pupils, while all the occasions contain the scores of i 
Same pupils. Pupils, in other words, appears as an additional factor, 2 
Which is nested within frame size and LQ. but which, within cach 9 
cross-classifications of frame size and 1.Q., is crossed with occasions. ; 
experiment has therefore been described as a four-way experimen in 
table 7.1, since each score may be classified in four ways, i.e. as yes 
to a particular frame size (4), IQ. group (B), occasion (C), and pup! 


Table 7.1 Scores of nine groups in a four-way experiment 


(eere BOO 


5241 Overall total — 13,789 


"Totals 


N.B. The totals for each of the twenty-seven ABC cross-classifications are shown circled. 
Within each of the nine AB cross-classifications the scores in the same column belong to the same pupil. 


"The totals for each of the nine AB cross-classifications are shown boxed. 
Key: A Small frame size A;-Medium frame size ^ 4;-Large frame size 
B,-Low I.Q.s B;-Average I.Q.s B;-High I.Q.s 
C;-Occasion 1 C;-Occasion 2 C3-Occasion 3 


q 


pup Suusaw 11144 susisa, 


D 


SUISSO4, 


TEI 


132 l Experimental Design in Education 


The analysis of the differences between cells—a cell being one of the 
ABC cross-classifications, as before—follows the same pattern as In 
section 6.2, the sums of squares for the main effects of frame size, I.Q. and 
occasions being separated as well as those for the three first-order and one 
second-order interaction between these effects. In addition, however, 4 
sum of squares for the differences between pupils must now be removed 
from the variation within the A x B cross-classifications,* leaving a residua 
sum representing the interaction of pupils with occasions, again within the 
Ax B cross-classifications. The calculation is shown below. The first ten 
steps parallel those of the analysis in section 6.2. 


13,789 x 13,189 
1. Total sum of squares = (1177--96?.- -+- +617)-——4— 


162 
Sum of 162 terms 
= 1,239,847 — 1,173,682:23 
= 66,164-77 
2. Between-cells sum of 646? 647? 443^, 13,789? 
squares } be (E pur P )- 162 


Sum of 27 terms 
= 1,202,121:17 — 1,173,682-23 


= 28,438.94 
3. Within-cells sum of 
squares } = 66,164-77—28,438-94 
= 31,725.83 


4. Between A levels sum) _ 4695? 4576? 4518? 13,789" 
of squares =a 54 * 4 — 362 
= 1,173,983-80—1,173,682-23 
= 301-57 


5. Between B levels sum) _ 4031? 4512? 5246? 13,789" 
of squares 7 4 + 54 55 54 162 
= 1,187,548:54—1,173,682:23 
= 13,866-31 


Ji 
> , ; ch ce 
* This was not possible in the experiment of section 6.2, since there ea 
contained the scores of independent groups, i.e. of different children. 
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6. Between C levels sum 5241? 4306? 4242? 13,789? 

of squares j er E WE UD 7s 
1,185,264-46— 1,173,682-23 
11,582-23 


yap yn? 0,107 
ies Ur nos dry 
Sum of 9 terms 
13,789? 
162 
1,188,296-94— 301-57 — 13,866.31 
—1,173,682-23 
= 446-83 


- Ax C interaction | 


M 


ll 


7. Ax B interaction ns 


of squares ) T 


—13,866:31— 


of squares (see table 


1849? 1728? 1439? 
os d ) —301:57 


B NECS 
Sum of 9 terms 
13,789? 
—11,582 23-5 
= 1,186,349-83 — 301-57 —11,582-23 
—1,173,682-23 
= 783.80 


1918? 1682? 1646? 
A8 TOUS + Tg ) -13.866 31 
Sum of 9 terms 
13,789? 
162 

— 1,199,965:39 — 13,866-31 —11,582-23 
—1,173,682:23 

= 834-62 

= 28,438-94—301:57— 13,866-31—11,582-23 
— 446-83 — 783-80 — 834-62 

= 623-58 


of squares (see table 


9. Bx C interaction sum 
7.2 ii) J ( 


—11,582:23— 


10. 4x Bx C interaction 
Sum of squares 
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Table 7.2 Total scores for the A x C and Bx C cross-classifications in 
table 7.1 


N.B. Totals for the 4x B cross-classifications are recorded in table 7.1. 


" e size 
Key: A,-Small frame size 4;-Medium frame size — 4;-Large fram 


B;-Low I.Q.s B;-Average I.Q.s B;-High I.Q.S 
condi an 1 C;-Occasion 2 C;-Occasion 3 
———————— M -—— ee 
11. Between-pupils (P) 2 2 272A 1744?* 
(within 4 x B) sum = (Fie LET HE) 
Qlaquates Sum of 6 terms à 
3112 3082 2507) 1729 
QT | TX 8 


Sum of 6 terms 


‘mar he 
+Similar sets of terms for each of Mun 
other seven A x B cross-classifica 


= 30,672.20 
12. Residual sum of 
Hein nn! «Pisas 
interaction, within = 7,053-63 
Ax B) ree 


scores; 
* Each of the numerators 334, 274, --- 272 is the sum of three 
1744 is the sum of eighteen scores, 
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a PUR ten steps in the calculation are similar to those described for 
a ctorial experiment in chapter 6.* For the between-pupils sum of 
cose. Separate correction terms are subtracted for each of the Ax B 
et a. FE eed This sum is a measure of the pupil differences within 
era AxB cross-classifications, and totalled for all of those cross- 
sity lons. As there are six pupils and therefore 5 degrees of freedom 

n each cross-classification, the degrees of freedom for this sum are 45. 


Table 7.3 Analysis of variance of the data in table 7.1 


Šen SS SS 


Sum of Degrees 


Source of variation Mean square 


squares of freedom 
E 301-57 2 150-78 
à 13,866-31 2 6,933157 
11,582.23 2 5.791-11T 
ici 446-83 4 111-71 
ce 783-80 4 195-95} 
Des 834-62 4 208-651 
em pio M 623-58 8 71:95 
upils (P), within 4x B — 30,672:20 45 681-60 
Residual (P x C), 7,053-63 90 78:37 
Within 4x B 
Total 66,164-77 161 


| Significant at the 5-per-cent level. 
ignificant at the 1-per-cent level. 


ee way, the residual sum of squares is a measure of the varia- 
the pupil rice within the Ax B cross-classifications after the removal of 
pupils ifferences. As the only main effects are those of occasions and 

» and as both have already been removed, this residual sum may 


* 
been Tue Partitioning of the total sum of squares for the present experiment has 
n stern ees in this way, as it is an extension of that described in chapter 6. 
into BUT wa approach would be that of separating the total sum of squares 
divisions « P between pupils and within pupils, 4, B and A xB being sub- 
the lat of the former sum and C, Ax C, Bx C and Ax Bx C subdivisions of 
ter (see Lindquist 1956). 
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therefore be described as the sum for the pupils x occasions need 
within each of the frame size x LQ. cross-classifications, and totalled for 
such cross-classifications. The sum has been obtained by subtracting A 
between-pupils sum (306,672-00) from the within-cells sum ur 
The degrees of freedom may also be obtained in this way, i.e. 1 35—45 = d 
(Alternatively, as there are six pupils (P) and three occasions been 
therefore 5x2 = 10 degrees of freedom, for the Px C interaction wit 
each of the nine Ax B cross-classifications, the degrees of freedom may 
be obtained as 10x 9 = 90.) E. 

The analysis of variance is set out in table 7.3. There are two er C 
terms, the mean square for pupils (P), and that for the residual (P a 
within A x B). The mean square for pupils is the error term for all ce j 
based on sets of scores from different pupils, i.e. frame size (A), LQ. a 
and the A x B interaction. The residual mean square is the error term d si 
other effects, i.e. those involving occasions (C), and therefore base ; 
sets of scores from the same pupils. The justification for this is postpon' 
until the next section. 


; ve 
Taking first the effects testable by the mean square for pupils, we ha 
the following F ratios: 


: z ize are 
Ir Gira < 1, so that the differences in frame size 4 


not statistically significant H 


; E 
for B, F = = 10-17, which for 2 and 45 degrees of fre 


mee - B); 
dom is significant at the 1-per-cent level (statistical table 2B): 


à nce 
and for A x B, F = 1171 


ee ize x intellige 
681-60 ~ b So that the frame s 


interaction is not Statistically significant. 
For the effects testable by the residual mean square we have 


5,791411 s of 


= 73: ich for 2 and 90 degree 
7837 73-87, which for 


Ps table 
freedom is significant at the 1-per-cent level (statistica 


for C, F= 
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195-95 
78-37 
dom is significant at the 5-per-cent level (statistical table 2A); 


for AxC, F= 


= 2-50, which for 4 and 90 degrees of free- 


208-65 
78-37 
dom is significant at the 5-per-cent level (statistical table 2A); 


fo Bx C, F = = 2:66, which for 4 and 90 degrees of free- 


and for 4x Bx G; r= = < 1, so that the frame size x intelligence x 


Occasions interaction is not statistically significant. 


The total scores for the 4x C and Bx C cross-classifications (upon 
which the two significant interactions are based) have been set out in table | 
7.2. In the 4x C table we see that, while on the first occasion the scores 
decrease sharply with increasing frame size, on the second this decrease is 
Only slight, and on the third the trend is actually reversed (though only to 
a slight extent). Learning from the large frames appears at an advantage 
only on the most delayed test. Clearly it is this difference in trend which 
underlies the significance of the frame size x conditions interaction. Again, 
Since the second-order interaction (AxBxC) is not significant, this 
difference in trend may be held to be the same at all three levels of I.Q. 

An inspection of the B x C table shows that although the scores at both 
the high and average I.Q. levels decrease as we proceed from occasion 1 to 

, the Score at the low I.Q. level increases (slightly) on occasion 3, though 
Temaining well beloy that for occasion 1. Alternatively, viewing the column 

ends, we may say that the decrease in score when proceeding to the 
lowest LQ.sis less pronounced on occasion 3. Again, from the insignificant 
Second-order interaction, this feature may be taken as fundamentally the 
Same for all three frame sizes. For both the 4x C and Bx C tables the 
Significance of the difference between any two scores in the same row or 
T could well be tested in the manner previously described (section 


The differential influence of occasions on the differences between IQ. 
levels is the only qualification we need to make in interpreting the signifi- 
cance of the main effect of I.Q. Overall the high I.Q. group does markedly 
better than the average I.Q. group, and the average I.Q. group markedly 
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better than the low I.Q. group. (See the total scores recorded at the bottom 
of table 7.1.) , 

The main effect of frame size, on the other hand, is far too small for 
statistical significance, though we see from the total scores that learning 
from the small frames appears more effective than from the medium 
frames, and learning from the medium frames more effective than from the 
large frames. As has been noted, the superiority of the small frames is most 
marked for occasion 1. Indeed, it is the differential effect of occasions on 
frame size which has been shown to be important. Similarly the expected 
and significant decrease in score on the later occasions needs to be qualified 
by the two significant interactions described. 


7.2 The model for a single crossing of the nested factor 


The model for the design of the programmed learning experiment described 
may be written as 


Xia = M+ A,+By+C,+(AB),;+ (AC) y+ (BC), + (ABC); t Past 
ah C ijkl 


The first eight terms are identical with those of the model for the three- 
factor experiment of chapter 6 (section 6.3), with A now referring to joe 
Size, B to intelligence and C to occasion. P, jiis a component common we 
all the scores of pupil / of level i of factor A and level j of factor B, while t E 
term (PC);xr—which could also be written more simply, though x 
descriptively, as e; jkr—is à component specific to the score of pupil / ai 
of level i of factor A and level j of factor B) on occasion k. Note that D 
subscript / occurs only with the accompanying subscripts 7 and ge e 
expressing the nesting of pupils within frame size and intelligence. n ns 
particular experiment described, i, j and k all run from 1 to 3, while / ru 
from 1 to 6. en 
The components analysis is shown in table 7.4. The balance pe 
subscripts and coefficients in the components of the mean-square uis 
tions is maintained by using a dot to precede the factors within whic a 
factor pupils (P) is nested. o2. as, therefore, denotes the variance of puo 
within the frame size and intelligence cross-classifications, and Groaz the 
variance of the interaction between pupils and occasions again witht? see 
frame size and intelligence cross-classifications. With this notation, V' om- 
that the number of subscripts and coefficients is the same for all ¢ 
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ponents, and that if a particular factor does not appear as a subscript, the 
corresponding lower-case letter appears as a coefficient. 


Table 7.4 Components analysis for a four-factor experiment, one factor 
being nested and singly crossed 


CIC c -———————— i 


Source of Degrees of Mean-square 
variation freedom* expectation} 
A a—i G2 .4.nc t- C03. Ap pbCOÀ 
B b-1 Gc. An-- C02. An-- Dacos 
c c—l OFc.an+paboe 
AxB (a—1)(b—1) Gc. An - C02. An-- DCOAn 
AxC (a—1)(c—1) G2c.An- bac 
BxC (b—1)(c—1) Obc.an+PAT5c 
AXBxC (a—1Yb—1)(c—1) oF e.an+ Poise 
P, within Ax B ab(p—1) OFc.antCOP-aB 


PxC, within Ax B ab(p—1)\(c—1) ^ o£cs 


* These are written for a levels of A, b levels of B, c levels of C and p levels of P 
hin each of the ab cross-classifications of A and B. 
* A, B and C are taken to be fixed effects, and P a random effect. 


wit 


It is best to begin setting out the mean-square expectations from the 
bottom upwards (i.e. starting with the basic variation Px C, within 
4x B) on the assumption that all effects are random, deletions for fixed- 
Dess afterwards being made in accordance with Schultz's rule. One further 
Point, however, needs to be made respecting these deletions. A component 
can qualify for deletion in the case of nesting only in respect of subscripts 
Préceding the dot. (The subscripts are referred to by Schultz (1955) as 

*ssential') A component is not deleted because of the fixedness of any 
effect specified by a subscript after the dot. The components for the case of 
ali effects being random are shown at the top of p. 142, the subsequent dele- 
tions for the fixedness of A, B and C being indicated by oblique lines. 

.. With the deletions indicated, these mean-square expectations become 
identical with those in table 7.4. It follows that the residual mean Square 
(from Px C, within Ax B) is the correct error term for testing the 
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Table 7.5 Scores of two groups in a four-way experiment 


Total (41-42) 


121412 714 
12 8 713 10 
10 11 6 14 12 


au ae 


121713 911 
1110 412 6 
6 13 13 12 6 


19 10 6 12 "2 


13 14 14 14 11 
10 511 15 14 
16 12 10 14 10 


11 11 13 13 18 


Pupil totals 
19 21 19 15 18 18 24 20 15 18 37 45 39 30 36 


16 — |14 15 14 18 30 

g 14 |33 23 22 40 

2+B:)|17 17142113  |1519 15 19 15  |32 36 29 40 28 
3 


22 16 13 14 18 24 18 15 20 19 4 28 34 37, 
46 3 


po erii s i involving occasions (C), since all the expectations 
ey age ae ve only o$c, 45 plus the component specific to the € ec 
Dei me pupils mean square (from P, with A x B) is the correct 
Bu AT ing the significance of the three effects based solely 0? 
Asc Binteracti erent pupils, i.e. frame size (A), intelligence (B) and the 

interaction. The F ratios in the last section were derived from this basis. 


7.3 A double crossing of the nested factor 


i etn experiment the nested factor (pupils) was crossed with one 

Cee isl penta It is possible, however, for the nested factor to 

win wo or more effects. The following experiment provides 2 
stration of it being crossed with two other factors. 
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Table 7.5 (continued) 


121114 614 
10 6 8 414 
1311 911 9 


10 6 8 11 14 


161512 8 6 
10 10 7 12 14 
13 12. 9 TALL 
9 914 718 


Pupil totals 


41 37 37 24 32 
29 18 20 18 40 
22 17 16 11 14 40 33 29 25 26 


13 8 21 20 20 27 20 37 32 4 
S 


Totals 4,|638 B,|419  C,|679 ^ Overalltotal = 1,290 
Alea  m|48 Ca| 6ll 


Bs | 468 


16 20 15 11 17 
18 8 6 82 


25 17 22,13 15 
11 10 14 10 18 


14 12 16 12 26 


and Each of the twelve cells in C; contains the scores of the same twenty 
cell i in the same order; similarly for the twelve cells in Cz. The total for each 
1$ shown circled. 


Eey: : 
€y: 4,~Physical science A;-Biological science 
B;-Knowledge B,-Application B;-Eyaluation 
C,-School 1 C;-School 2 
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PxC, within Ax B: o2c.45 


P, within 4x B: ^ c2cns--col.4s 
AxBxC: G?c.An-- DOÀnc 

BxC: G$c.An-- DG eic ^- aG5c 
Ax C: G?c.an- DGAAc ^ phorc 
AxB: 


2 2 
ažc.an+ C05. an-- Dg oc - DcOÀn 


(Note that AB is involved in the subscript end 
tion P.AB, so the component coż, 4s now enters in 


: A ince 
the mean-square expectation for the first time sinc 
the second line above.) 


C: G?c.4n-- Pg sc pagg. +phedé-+pabod 
2 
B: Gc as co? as Dp - pagi + peg + pace} 
(Note that co ,, appears, as it contains the ee 
script B, and is not deleted despite the fixedness of 4. 
4: 


2 
Gc an co? an - ppc pog pog t pbcos 


(Similarly, co ,, again appears, and is not deleted-) 


Two tests, one of Physical science and one of biological science, Mr 
o Structed, each being Structured into subtests testing (a) knowledge 3 
specific facts, terminology, genera] principles, etc.), (b) the application 


knowledge (in problems, new situations, etc.) and (c) the evaluation an 
Scientific interpretation 


was then a four-way experiment, in that ee 
Score could be classified i 
(A), objective (B), school 
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ee comprise that from a 2x3x2 factorial experiment, subjects (4), 
een 1 and schools (C) being the three factors. We may begin, 
the mai d M separating the sum of squares for between cells into sums for 
AxC CE A, B and C, for the three first-order interactions A x B, 
these eff x C and the one second-order interaction AxBxC between 
pattern The calculation of these sums of squares follows precisely the 
B eror escribed in sections 6.2 and 7.2, and is not repeated here. The 
T squares are recorded in table 7.6. 

t UA must then be taken of the nested factor, pupils, and this has to 
differenc dri at distinct ways. First, there is a sum of squares for the 
. intable > S omwen pupils (the differences between the pupil totals recorded 

interaction p each school. Secondly, there is a sum of squares for the 

Between m etween pupils and subjects (a sum based on the differences 

there is e B, +B,+B; totals of table 7.5) within each school. Thirdly, 

(a sum s sum of squares for the interaction between pupils and objectives 
Within Vias on the differences between the A, 4- 4; totals of table 7.5) 
fra thn ee, If, finally, these three sums of squares are subtracted 
Squares for Waa sum of squares, a residual sum—actually the sum of 
he second-order interaction pupils x subjects x objectives 


Within ea 
ch à 
Set out i AME ea be obtained. The pattern of the calculation is 


Ly 
otal sum of squares = (5+8+ °° jap nn 1,290 
Sum of 240 terms m 
= 7,878-00— 6,933:75 
= 944-25 
2. Betw 110? 108? 112A, 1,290? 
€en-cells sum of PN (se eun a) > 
of squares = | -z0 29 +39) 240 


Sum of 12 terms 
= 7,004-10—6,933:75 
= 70:35 
3. Withi 
ithin-cells sum of squares = 944-25— 70:35 
P = 873-90 
. Partition; 
artitioning of the between-cells sum of squares into sums of squares 


f 
10. di A, B, C, Ax B, AxC, Bx Cand Ax BX C (procedure the same as 
at shown in section 6.2). 
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11. Between-pupils (P) (within 372. 452 37^* 679? 
C) sum of squares } z (P D +) ^ 120 
Sum of 20 terms 
41? 37 46^ 611? 
oe dd +3) 
Sum of 20 terms 
= 349-65 
12. Pupils (P) x subject (4) 19? 212 19^ 679? 
(within C) sum of sm = (S++ Ap +5) ~ 120 


Sum of 40 terms 
37 452 37, 679 
[E - 55 
Sum of 20 terms 
€ m 349? a) 


60 ' 60 120 
+A similar set of terms from the 
B,-- B; -- B, totals in C; 


— 108-78 
13. Pupils (P) x objectives (B) 12? 142 18^ 679? 
(within C) sum of squares = (1 ne +5)- 120 
Sum of 60 terms 
37 452 37 zaj 
[Te I. 120 


Sum of 20 terms 


* The denominator of 6 is necessary, sin 0 s O 
» ce each pupil total is the su 
Subtest Scores; similarly for the denominators in the following two 
Squares (12 and 13). 


B) (within is derived from the 
ATA: totals in à EAE ¢ p C) sum of squares (13) is derive 
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218? 4212? 249? 679? 
40 40 40 120 
+A similar set of terms from the 
A,+Az totals in C; 


= 270-04 
14. Residual sum of squares 
(Px Ax B interaction, = 873-90— 349-65 — 108-78 —270-04 
within C) 
= 145-43 


Table 7.6 Analysis of variance of the data in table 7.5 


Iu ne ee Se E E eee 
Sumof Degrees 


Source of variation 
of squares of freedom 


Mean square 


A 0-82 1 0-83 
B 28-67 2 14-33* 
G 19-27 1 19-27 
AxB 14-41 2 7-20* 
AxC 2:39 1 2-39 
Bxc 111 2 0-55 
AxBxC 3-68 2 1:84 
Pupils (P), within C 349-65 38 9-20f 
Px A, within C 108-78 38 2-86 
Px B, within C . — 27004 76 3:55] 
Residual (PxAxB), within C 145-43 76 1-91 
Total 944-25 239 


* Significant at the 5-per-cent level. 
T Significant at the 1-per-cent level. 


The analysis of variance is shown in table 7.6. The degrees of freedom 
9r pupils follow from there being twenty pupils, and hence 19 degrees of 
Teedom for each of the two schools. Similarly, within each school the 

Pupils x Subjects interaction has 19x 1 = 19 degrees of freedom, and the 

Pupils x objectives interaction 19x2 = 38 degrees of freedom. The 

SBrees of freedom for the residual sum also follow in this way, or again 
K 
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they may be obtained by subtracting the degrees of freedom for e other 
three within-schools sources of variation from those for within ce = i 
The mean squares for the last four lines of table 7.6—the DE ani 
schools sources of variation—provide the error terms for testing the ^ ol 
cance of the other effects. Thus, both subjects (4) and the subjects xS uM 
interaction (4 x C) are tested against the mean square for pupils x su 7 i 
(PxA), within schools. (An explanation of the choice of error me 
postponed until the next section.) Both these effects are clearly insigni dde 
the F ratios being less than unity. Again, both objectives (B) an de 
objectives x schools interaction (B x C) are tested against the up "P 
for pupils x objectives (Px B), within schools. (In other words, ep tis 
effect involving either 4 or B the appropriate error term is provide a 
interaction of the effect with pupils within schools.) We have, therefore, 


m 
E z5 = 4°04, which for 2 and 76 degrees of freedo 
is significant at the 5-per-cent level (statistical table 2A); 


k F i tion 
and for Bx C, F = m < 1, so that the objectives x schools interact! 


is not statistically significant, 
The mean square for pu 


testing the significance 
therefore, 


Š for 
pils (P) within schools is the appropriate ae 
of the differences between schools (C). We 


19-27 m 
9j = 209 which for 1 and 38 degrees of freedo 
is not significant at the 5-per-cent level (statistical table 2A). 
Finally, the subjects x Objectives 
subjects x objectives x schools int 
residual mean square. We have, 
7-20 


ee" 
ia eae d 191^ 3-71, which for 2 and 76 degrees of ff 


-— A); 
dom is significant at the 5-per-cent level (statistical table 2 


for C, F = 


" 3 rder 
interaction (4 x B) and the second at the 
eraction (A x B x C) are tested again: 
therefore, 


L ; jon is 
and for AX Bx C, F = in < 1, so that the second-order interactio 


not statistically significant. 
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We may note, too, that the significance of each of the first three within- 
Schools effects could also be tested against the residual mean square, 
though these tests are of little practical interest. (Real differences amongst 
pupils and of pupil interactions may usually be assumed.) Two of these 


effects prove to be significant at the l-per-cent level, as indicated in 
table 7.6. 


Table 7.7 Total scores for the 4 x B cross-classifications in table 7.5 
A 1 A, 


212 207 
208 195 


218 250 


To interpret the significant subjects x objectives interaction, the total 
(or mean) scores in the six cross-classifications have to be compared. These 
are set out in table 7.7. We see that although for each of the first two 
Objectives, knowledge and application, the score in physical science is the 
higher, for evaluation it is the score in biology which is the higher. 
Assuming that all the subtests have been properly standardized, we may 
Conclude that in the two schools sampled an understanding of the evalua- 
tory aspects of science is better in biology than in physics, whereas achieve- 
ment in the other aspects of science is better in physics. Moreover, since the 
Second-order interaction, subjects x objectives x schools, is not significant, 
this finding may be held to apply equally to each of the schools. Overall it 
Is the differences between objectives (not subjects) which command 
attention, Thus, the main effect of objectives is significant, as is also the 
Pupils x objectives interaction within schools. 


7.4 The model for a double crossing of the nested factor 


The model for the science experiment just described may be written as 
Xijı = M+ A;+B;+C,+(AB)i;+(AC)ixt+ (BO) jx + (ABC); n+ P 
+ (PA) igi + (PB) gi (PAB) jr 
Again, the first eight terms are the same as those of the model for the 
three-factor experiment of chapter 6 (section 6.3), though A now refers to 
Subject, B to objective and C to school. P,; is a component common to all 
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the scores of pupil / in school k. (PA), is a component resulting ae 
interaction between subject i and pupil / in school k, and (PB) jrr ÍS i 1 
a component resulting from the interaction between objective j an Se 
in school k. Finally, (PAB), jii (the residual term) is a component S ot 
from the interaction between subject i, objective j and pupil / in scho " 
We note that the subscript / does not occur without the wes. 
subscript k, since pupils are nested within schools. In the partic 
experiment described, i and k both take on the values 1 and 2, j the va 
1, 2 and 3, while / runs from 1 to 20. r 
The components analysis is given in table 7.8. Again the dot "— 
is used, the dot preceding the factor (schools) within which pupi 5 dd 
nested. (Thus, c? c denotes the variance of pupils within schools, eer 
variance of the interaction between pupils and subjects within ps jid 
and so on.) The mean-square expectations are derived from the p 
variation (03 45.c) as before, i.e. on the initial assumption that all effects di 
random, deletions for the fixedness of effects being subsequently e " 
(section 6.3). Subjects and Objectives must Obviously be regarded as id 
objects, and schools is also taken as a fixed effect. Pupils, on the ot 


: be 
hand, is a random effect. The mean-square expectations would then 
derived as shown below. 


Px Ax B, within C: 0245.c 
Px B, within C: 
Px A, within C: 
P, within C: 


2 2 
OPAnB.c- QOg p.c 


2 
OPAp.c-bo2.,4. c 


Cbas.ct O96 c - bod abo? c 


AxBxC: GP 4n.c -DO3 gc 

BxC: Gban.c-- 007» c-- Peet pace ina- 
(Note that BC is involved in the subscript meres is 
tion PB.C, so the component aoZp.c appears, à 
not deleted since P is a random effect.) 

AxC: Chan.ct b034 c - po 4 phoie is not 
(Similarly, the component bo?.,.c appears and 3s 
deleted.) 

AXB: 0? s.c Doe - peo 

C: 


05 s.c - O95 c+ bgeX tabo c +p hot page 
+pbo<é+ paboz 
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B: 02 p.c t 103. c-- Dg c pape + DOp Aa + PACTS 
(Note that the component 462 .c is retained despite 
C being a fixed effect, since the subscript is not 
‘essential’, i.e. it comes after the dot.) 


62 s.c. b62 4c - Dodie + DbgAe + pog - pbcoa 
(Similarly, the component boj..c is retained.) 

ES the deletions indicated, the mean-square expectations then become 
: SEC as shown in table 7.8. It follows, too, that the correct error terms for 
esting the significance of the various effects are those employed in the 


dd iq section. All the F ratios used (p. 146) were, in fact, derived from 
* components analysis of table 7.8. 


A: 


n 7.8 Components analysis for a four-factor experiment, one factor 
emg nested and doubly crossed 


ccc eoo ee ee es 


Source of variation Bess Mead 
of freedom* expectation] 
x a—i 02 An.c - b02.4.c - pbco? 
c b—1 02 An.c - GG? p.c - paca 
pe c—1 GE an.c- aba. c-t paba? 
yii (a—1)(b—1) GE an. c E DOG An 
no (a— 1)(c— 1) GE an c boba.ct poa 
Hau (b—1)(c—1) GEan c A02. c páGhic 
d m. (a—1)b—1)c—1l)  G0$4s.c--Po4nc 
2 in C c(p—1) 62 1.c - abG?.c 
Por within G ce(p—1)(a—1) ofas.c+boka.c 
nds Within c c(p —1)(b—1) O2 4p.c T- Q0? p.c 
X B, within C (residual) c(p—1)(a—1)(b—1) c£4».c 
» xo These are written for a levels of A, b levels of B, c levels of C and p levels of 


un each of the c levels of C. 
4, B and C are regarded as fixed effects, and P as a random effect. 


7.5 Further consideration of error terms 


Notwi à Zn aun 
1 bringin: the implications of a components analysis, there are 
es when the use of another mean square for error is justified. Suppose. 
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for instance, that in another experiment of similar design to the ae 
experiment the schools effect may be considered random. (In CUN 
we select pupils from a random sample of schools, in which case, > jae 
the number of schools would almost certainly be far greater than ail 
This means that some of the deletions of components from ae mi ee 
Square expectations set out on pp. 148-9 would not now apply. ote 
instance, in the mean-square expectation for A the component p ‘in 
would have to be retained, so that the appropriate error term for = d 
the significance of subject differences is no longer the mean square hs e 
pupils x subjects interaction (P x A) within schools. We see, in fac! m 
the error term must be the mean square of the subjects x schools 1 


Mean square for A imates 
action (4x C). This is because F = —— 0" square !OT ^ estimi 


Mean square for A x 
Gf? 4n.c- Do? 4.c-- pbo3c-E pbco? 


$ if gł iS 
2 7 > which is greater than unity only if 4 
OPas.ctbops.ct+pbar¢ 


non-zero. Nevertheless, if a pattern of results such as that of table 7.6 T: 
again obtained, it would be more prudent to test school. iron A 
against the mean square for the pupils x subjects interaction (P 
within schools as before. f ction 
This is because the mean square for the subjects x schools intera 


x ig i t for 
(4x C) is below expectation: it has turned out to be less than tha 


2:39. 2] 

the pupils x subjects interaction (PxA) within schools. (r = 786 
c jmates 
despite the fact that F ee ate soe 


Mean square for P x 4, within C 
G?an.c-- 02.4. c-pbÀc 
E AU S P ALC TEU AC 


T RE o) 
z A which cannot be less than unity, even if o4c 
OPAn.c-- DOg A.c 


A . vare 
Significance tests based on the subjects x schools interaction mean nd too 
will then be unduly optimistic, i.e, ‘significant’ results will be gbi eai 
frequently. The use of the pupils x schools interaction within schools 


square as the error term is then to be preferred.* 


are 

* Similar considerations would caution against the use of the mean ST, 
for the objectives x Schools interaction (B x C) as an error term when ms 
out to be less than the mean square for the pupils x objectives inte e 
(P xB) within schools—and also against the use of the mean square the 


x S Ts an 
Subjects x objectives x schools interaction (4 x B x C) when this is less i 
residual mean square (P x A x B) within Schools. 


ction 
e 
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A practice which may occasionally be justified, and which provides an 
error term other than that dictated by a components analysis, is the 
practice known as pooling. This consists of adding together two or more 
sums of squares and then dividing by the combined degrees of freedom. 
The following illustration from table 7.6 is convenient. 

A The pupils x subjects interaction (P x A) within schools fails to attain 
significance at the 5-per-cent level.* Clearly a non-zero value for the 
component o3,.¢ has not been established. If we now make the step (a 
questionable one) of assuming that this component is in fact zero, it 
follows that the mean square for the pupils x subjects interaction (P x A) 
Within schools and the residual mean square (Px 4x B, within C) both 
estimate precisely the same parameter, 02 4n.c. A better estimate of this, i.e. 
à more stable estimate in that it is based on a larger number of degrees of 
Ieedom, would then be forthcoming from combining the two sums of 

Squares and degrees of freedom as follows: 
108-784-145:43 25421 


This 384-76 114 . 
the pooled mean square would then replace the former estimate (1:91) as 
obj error term in testing the significance of, for instance, the subjects x 

Jectives interaction (A x B). 
ia nup in this way is justified only if there is no pupils x subjects 
that Ye within schools, ie. if o24.c = 0, and it is extremely doubtful 
site is is in fact so. First, pupils usually react differently to different 
ane and a non-zero o2a.c would be expected (though if subjects were 
ph es y similar and had been taught in the same way—as is possible with 
Eo and biological science taught as part of a unified general science 
Stained ZrO Gp4.c might reasonably be hypothesized). Secondly, the 
cance ed subjects x pupils interaction only narrowly fails to attain signifi- 
It es the 5-per-cent level (and is significant at the JO peus level. 
ane have been more reassuring if we had expected a Zero OP a.c—if the 
ained interaction had provided a F ratio far nearer unity. 

5 Generally, then, we should conclude that the pooling of sums of squares 
ae error terms based on larger numbers of degrees of freedom may 
the eo only (a) if there are non-statistical grounds for expecting 
evant population interactions to be zero (or, at any rate, to be of 


= 2-23: 
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negligible importance), and (b) if such grounds are clearly changes 
the obtained F ratios, ie. they should fail to attain significance M 
predetermined level (such as the 5-per-cent level), and preferably fa a 
more than a narrow margin. Confidence that the relevant interactions E 
Zero, Or as near zero as would make little practical difference, should F. 
reinforced by the statistical evidence before pooling can be advised. A 
more detailed discussion of this problem is provided by Binder (1955). 
paper by Paull (1950) might also prove helpful. 
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Chapter 8 Latin-square Designs 


81 Restricting randomization 


In Chapter 5 designs with randomized blocks were described. The sole 

reason for forming blocks—on the basis of such measures as I.Q. and 

Previous attainment—was that of reducing variability. The variability 

Within blocks is considerably less than would be the case from measure- 

ments made from persons selected at random. Nevertheless, within each 
lock randomization plays an unfettered role (see figure 4). 

Sometimes a further restriction on randomization might be advan- 
— Thus, in agricultural experiments where the effect of different 
dme (fertilizers) has to be evaluated, the total land available is 
s ed into plots by an equal number of rows and columns. Different 
needs are then assigned to the plots, so that each treatment Occurs 
tt E only once in each row, and once and only once in each column. 
- ows that the number of treatments must be the same as the number 
= ows (or columns). It follows, too, that if the rows are thought of as 

Tresponding to the blocks of a randomized-blocks design, the arrange- 
m of treatments for the plots within each block can be random provided 
five i same treatment is not repeated in any column. With five rows and 
i E umns, for instance, and with the numerals 1, 2, 3, 4 and 5 referring 

the different treatments, one such arrangement might be as follows: 


Columns 
i ii iii iv v 
i 1 2 3 4 5 
ii 4 5 1 3 2 
Rows iii 2 1 4 5 3 
iv 3 4 5 2 1 
v 5 3 2 1 4 
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The essence of this arrangement is that differences in natural soil fertility 
are balanced out in two directions simultaneously: differences from row to 
row and differences from column to column. The arrangement is known as 
a Latin-square design.* Randomization procedures for obtaining Latin 
squares are discussed by Fisher and Yates (1963). 


Table 8.1 Mean scores of four groups on four word lists presented in 
four forms (Latin-square design) 


Word lists (A) 
2 3 4 Totals for B 


1} (Cy) 10-2 (C,) 10-7 (C3) 8-4 (C,) 11-0] 403 
2 | (C2) 13-5 (Cy) 13-1 (C108 (C3) 10-1} 475 
Groups(B) 3 | (C3) 10-6 (Cy) 9:3 (C) 8-5 (C,)11-9| 403 
(Cy) 10-2 (Cj 84 (C,) 12-7 (C, 7:8) 391 

Totals for A 44-5 41:5 40-4 40-8 Totals for C 

c, 3e. 

Overall total = 167-2 C; E 

€, | 375 

E 

C, | 428 


Key for forms: C,-Dictation C;-Multiple-choice 
C;-Incorrect spelling C,-Completion 


In educational research the different treatments might be the forms 1? 
which a certain skill or trait is tested, the columns might be different tests 
and the rows different individuals or groups. An illustration similar tO E 
research by Nisbett (1939) is appropriate. Different forms of testing 
spelling are investigated. Thus, children could be asked to write dow? 
words from dictation, to correct a list of words incorrectly spelt, to choose 
the correct spelling from a number of alternatives (multiple-choice f orm), 
or to complete words with a ‘framework’ of letters supplied (complet? 
form)—four forms in all. Four lists of words thought to be of about equa 7 
difficulty would then be prepared, and four groups of children selected b 
testing. Each group would have to spell the words on all the lists, and ens 


v. by 
* This is because the treatments are often denoted not by numerals bet 
Latin letters, 4, B, C, etc. 
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list would be presented in a different form; but the combination of list and 
form would be different for each group. The design would be a Latin 
Square. A possible arrangement would be that shown in table 8.1. 

We see that group 1 attempts list 1 from dictation, list 2 from the 
multiple-choice form, list 3 from incorrect spelling and list 4 from the 
completion form; that group 2 attempts list 1 from the multiple-choice 
form, list 2 from the completion form, list 3 from dictation and list 4 from 
Incorrect spelling; and so on. Each treatment (C)—the form of the test—is 
administered to all the groups, and each list of words is also administered 
to all the groups. Again, each list of words is administered with all the 
treatments. Differences between both groups and lists of words are 
therefore eliminated from the differences between the treatments. 

With the data of table 8.1, the breakdown of the variation would be as 
Shown below. An important point is that sets of treatment totals can be 
extracted from the table as well as sets of totals for columns (4) and rows 
(B). Thus, the total for C, is obtained as 


10-2--10-8--9:3-- 7-8 = 38-1 


Sums of Squares for lists (A), groups (B) and treatments (C) are obtained, 


and these are all subtracted from the total sum of squares to give a residual 
Sum of squares, 


167-5 x 167-2 
l. Total sum of squares — — 10-2? 10-7? «+» +782 aii, 
16 
Sum of 16 terms 
= 1,792-04—1,747-24 
= 44-80 
2. Between-lists (4) sum of) 445? 41:5? 40:8? 167-2? 
Squares — qut ee ee 
= 1,749-82—1,747-24 
= 2-58 
3. Between-groups (B) sum 403% 7-55. 29 12 Nero? 
of Squares = zx 4 = E 16 


= 1,758-31 — 1,747 -24 
= 11:07 
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4. Between-treatments (C)| _ 38-1? 488^, - 29 102 
sum of squares FEST 4 4 16 
= 1,767-78 —1,747:24 
= 20-54 
5. Residual sum of squares = 44-80—2-58— 11:07 — 20:54 
= 10°61 


The sums for all three of A, B and C have 3 degrees of freedom (each 
effect having four levels) and the total sum has 15 degrees of freedom 
(being based on the scores). The residual sum therefore has 15—9 = 6 
degrees of freedom. In general with an a x a square each of the three effects 


would have (a—1) degrees of freedom, the degrees of freedom for the 
residual sum being 


a’ —1—(a—1)—-(a—1)—(a—1) = à?—3a42 = (a—1)(a—2) 

The analysis of variance is accordingly set out as in table 8.2. If the 
residual mean square could be accepted as a valid estimate of error, the 
treatment differences (C) would be tested by F — E — 3:86, which for 
3 and 6 degrees of freedom is not significant at the 5-per-cent level (statisti- 
cal table 2A). The A and B differences could also be tested in this way 
they happened to be of experimental interest. In this experiment they are 


not. 
Table 8.2 Analysis of variance of the data in table 8.1 


ee 


Source of Sum of Degrees 


variation squares of freedom Mean square 
ee a RT 
A 2-58 3 

B 11:07 3 

C 20:54 3 6:84 
Residual 10-61 6 177 
Total 44-80 15 


—M— M ÀÁ——ÓMH (a 


The analysis is one based upon group means, and is therefore precisely 
the same as if the rows represented different individuals. If, however, 
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scores of the individual children of each group were considered, the total 
sum of squares would first be resolved into sums for between cells and 
within cells (see figure 5), the sums for lists, groups, treatments and 
residual then appearing as components of the former. The within-cells 
sum would also be resolved into separate components if the scores of the 
same group of persons appear in all the cells of any one row. The complete 
analysis is developed in section 8.2. 

We may note, too, that the rows of the square differ not only by each 
having a different group but also because the order in which the treatments 
are presented is different. (It is assumed that all groups take the word lists 
in the order 1234, the columns of the square.) Thus, group 1 takes 
the test forms in the order 1 2 3 4, group 2 in the order 2 4 1 3, and so 
on. Therefore, even if the groups happen to be exactly equal in spelling 
ability, a non-zero row effect must be expected. Again, whatever the actual 
group differences in spelling ability, differences arising from the order in 
Which the test forms are taken will also be involved. In other words, the 
effect of order is confounded (or mixed in—see footnote p. 128) with 
that of groups. It is also confounded with that of word lists. A way of 
Separating out the effect of order is described in section 8.5. 


8.2 The model 


We shall first describe the model for the Latin square when each cell 
Consists of 7 scores, all the scores from any one cell being independent of 
those from any other. The model when there is only one score per cell (as 
is the case for the experiment described in section 8.1) then follows from 
Putting n = 1. With the same notation as that used in previous designs, 
the model may be set down as 
Xing = M+A;+Bj+ Cyt Rit eijk 
where M isa component common to all the scores; 
A; is a component common to all scores in level i of factor A, i.e. 
all scores of column i in an arrangement of data like that of 
table 8.1; 
B, is a component common to all scores in level j of factor B—or 
all scores in row j; 
C, is a component common to all scores in level k of factor C—or 
all scores in treatment k; 
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Rig is a component common to all scores in column i, row j and 
treatment k; . f 
and e;j, isa component specific to the score of person /in column i, row] 
and treatment k. í 
Generally all three of i, j and k run from 1 to a (in the experiment described 
a = 4). The six contributions to the score x;;,, are all independent of each 
other—and so, in particular, person / of any one cell is not the same 
person as person / of any other—and the As, Bs, Cs, Rs and es are regarded 
as being drawn from normally distributed populations with means of Zero; 
and variances of c2, o, c2, o£ and c? respectively. 

The components analysis is then as shown in table 8.3. We see that the 
residual (R) mean square provides the error term for testing the significance 
of each of the three main effects. The B effect could not be designated 
‘groups’ as in the experiment of section 8.1 because the analysis Pre 
supposes separate, and independently selected, groups for each cell. B 
would have to be a characteristic common to all scores in each row, such 
as when the scores come from persons of the same level of motivation 0f 


IQ. 


Table 8.3 Components analysis for a Latin-square 
experiment, with independent groups in each cell 


—ÓM——— —— 


Source of Degrees of Mean-square 
variation of freedom* expectation 
A a—1 c? - nocà -- ano 
B a=] c? - nocà -- anos 
Cc a-1 c? +nok+anoe 

R 


(a—1)(a—2) o? no 
Within cells a*(n—1) o? 


—— TIT KR CREE 
* These are written for an axa square with n scores in each cell. 

s : ; : oW 
If, as is often the case in practice, the scores from cells in the same ond 
are not independent and are all obtained by the same group of 7 p 
the above model is inadequate. An additional component representing 


ar. 
element common to all scores obtained by the same person must appe 
The model becomes 
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Xia = M+A;+Bj+C,t RigPutéin 
where six of the terms have the same meaning as before, and where the 
additional term P;, is a component common to all the scores of person 7 in 
tow j. This component is regarded as coming from a normally distributed 
Population with a mean of zero and a variance of a. In accordance with 
the dot notation adopted in chapter 7, this variance will be represented by 
9.» showing that the persons (P) are nested within rows (B). ož.» can, of 
Course, be estimated independently of the other effects in precisely the 
Same way as, for instance, the effect of pupils was estimated from the data 
of table 7.5 (see pp. 140-1). The components analysis is then as shown in 
table 8.4. The source of variance designated ‘remainder’ could also be 
described as the Px A interaction within B. Again we see that it is the 
residual mean square which provides the error term for testing the signifi- 
cance of the A and C effects. No test for the significance of the B effect is 
evident, but this would seldom be of interest anyhow. Usually the persons 
tested would be divided into groups (the rows of the square) at random, so 
that there would be no real row differences (i.e. c} = 0). 


Table 8.4 Components analysis for a Latin-square experiment, the same 
group being used for all cells in a row 


"S eee 


$225 Degrees Mean-square 
Source of variation of. freedom expectation 
4 "1 o? 4 no - anc] 
B (rows) a-1 c? -- ac y 4- no& - ana 
C a-1 c? 4- no? 4- ance 
R (a—1)(a—2) c? -- no 
Persons P, within B a(n—1) c? +a03.5 
Remainder a(a—1)(n—1) o? 


UM Eu. . c LER RM . P cn 


* " Mal 
These are written for an axa square where persons are nested within 
Tows (n in each row). 


Note that if n = 1, the last line of table 8.3 and the last two lines of 
table 8.4 disappear (since the degrees of freedom become zero). The two 
tables become identical, in fact, as o? and (in table 8.4) 62.5 must then 
be removed from all the mean-square expectations. o2 becomes the 
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mean-square expectation of the residual and provides the error term for 
testing the A, B and C effects. We have already seen that the residual mean 
square is the appropriate error term for testing the A and C effects when 
groups not individuals constitute the rows (ie. n > 1). The degrees of 
freedom on which the residual mean square is based, (a—1)(a—2), is 
therefore of special interest. 

With a 4x 4 square the number of degrees of freedom for the residual 
is 3x2 — 6, and with a 3x3 square it is only 2x1 = 2. Obviously 
considerable instability must be expected in the residual mean square for 
Latin squares as small as these. As large squares are seldom practicable in 
educational research—since each addition to the row and column size- 
must be matched by an additional treatment—the advantage of using, ! 
possible, an error term based on a larger number of degrees of freedom 15 
apparent. One way of increasing the number of degrees of freedom for 
error would be to use not a single Square but several squares in combina- 
tion. A second way (not dissimilar from the first) would be to use groups; 
not individuals, in the tows, and so secure a within-cells (table 8.3) or 4 
remainder (table 8.4) mean square. 

We see from the components analyses, however, that the A and g 
effects can be tested for significance by the within-cells or remainder mean 
square only if o% = 0, i.e. only if there are no distinctive components for 
cells. In certain circumstances this assumption might not be unreasonable. 
These circumstances, however, must include an allocation of the treatments 
to the cells of the square at random—subject only to the requirements ofthe 
design (i.e. no treatment appearing twice in the same row or column). 10 
particular, the treatments must not be allocated to the cells in any pres 
cribed order. Latin Squares in which this has been done are terme 
systematic squares. Examples of systematic squares are the diagonal square: 


Which for five treatments would be written as and the Knut. ' 


Vick square 
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If squares such as these were deliberately chosen, the within-cells or— 
When the same groups are used throughout the rows—remainder mean 
Square would not provide a valid estimate of error. Furthermore, even 
When individuals and not groups are represented by the rows of the square, 
à deliberate choice of a systematic square is inadvisable. This is because 
the R components of the model would not then constitute random errors. 
A discussion of this is provided by Fisher (1951). 


83 Comparison with the factorial design 


A practical advantage the Latin square enjoys over the factorial design is 
that far fewer combinations of factor levels are needed. Thus, with three 
factors each of four levels—which corresponds to the experiment described 
1n section 8.1— the factorial design would require 4x 4x4 — 64 different 
combinations for testing, as against only sixteen for the Latin square. For 
the particular experiment of section 8.1 a factorial design would not be 
appropriate. It would require combining each word list with each form of 
testing with each group, which would mean that each group would be 
asked to spell the same words in four different forms of testing. This 
Arrangement would undermine the experiment. A person's ability to spell 
Certain words would inevitably be affected by being required to spell the 
Same words before (in a different form of test). For many experiments, on 
the other hand, a factorial design would be practicable—provided, of 
Course, that the number of levels of all three factors is the same, or could 
* made the same. The purpose would then be, not that of removing 
extraneous influences, but of evaluating the effect of factors of intrinsic 
Interest. It is important to realize, however, that the advantage of the Latin 
Square can be bought at too high a price. 
The price to be paid can best be realized from comparing the model for 
Latin Square (pp. 157-8) with that for the three-factor experiment in 
pter 6 (p. 117). We see at once that, compared with the model for the 
: torial design, the Latin square has all the interaction components 
missing, In other words, the Latin-square design assumes that all the 
interactions are zero. This assumption is necessary if only because there 
"à Simply not enough degrees of freedom available for separating out all 
€ Interactions that may exist. If any interactions do in fact exist, they are 
confounded with the main effects. Also, the residual mean square would 
Not be a suitable error term for testing the significance of the main effects. 


L 


the 
Cha 
fac 
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Let us suppose that in the experiment described in section 8.1 an 
appreciable B x C interaction exists, and that the interaction components 
are as shown below. (These correspond to the fourth entry in each of the 
cells of table 5.4. We note that they sum to zero in every row and column.) 


C, C; C3 C4 


The Latin square for the experiment is as follows (see table 8.1). 
Ay Az A; Ag 


i " e 
Consider the A effect, that based solely on the differences among UP 
column sums. Each column sum is the sum of four cells as set out be 


Column 1 Column 2. Column 3 Column 4 
A4B,C, A,B,C, A3B,C; A4B,C« 
T4A,BIC, tAjBiC, -A34B,C, +A4B2Cs 
+4,B3C; — ABC, +43B,C, + AgB3C2 
+A, B,C, +A,B,C3 +A3B,C, +44B4C1 


The sums are obviously balanced with respect to both the B and the 1 
effects separately, since each of the Bs and Cs enter once, and only 0DC^ 
into each sum. In other words, each of the B and C effects acts indepe i 
dently of A. The sums are not balanced, however, with respect tO * ‘a 
interaction of B and C; this is evident if we write in the possible inte 
components specified above. (See the top of p. 163.) It follows, there ng 
that with a Bx C interaction considerable differences can exist am ct 
the column sums, even if 4,, 4), 43 and A, do not differ in their pii E: 
at each of the levels of B or C separately. The B x C interaction, 19 9 

words, is confounded with the 4 effect. "ET 

In the same way it can be shown that the Ax C interaction 1$ 


u——— — 
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Column 1 Column 2 Column 3 Column 4 
5 0 —2 EX 
qo SUE gest A 
A,B,C, ABC. A3B,C3 A4B,C4 
7 =1 - —4 
rnt p y aS 
+4,B,C, +A2B,C4 +A3B,C, +A4B,C3 
-1 4 -1 -2 
res ~ ~ mA 
*4,B4C, t A5B3C, +A3B3Cy +A4B3Cz 
5 7 —5 = 
- —— —— —— 
*t4,B,C, +A2B,C3 +A3B,C2 T AGB4C, 

547—145 0-144447  —2-2-1-5 -3-4-2-1 
= 16 = 10 = —10 = —16 


founded with the B effect, and that the A x Binteraction is confounded with 
the C effect. It can be shown, too, that the second-order interaction 
Ax Bx C is confounded with each of the A, B and C effects. If zero 
Interactions are not assumed, the mean-square expectations for the Latin- 
Square design will in fact be as given in table 8.5. (See Wilk and 
Kempthorne 1957 for a detailed discussion.) These are written for a mixed 
model with the rows effect (B) random and the columns (A) and treatment 
(C) effects fixed. If the rows effect is also fixed, then the component o%n 
must be deleted from the mean-square expectation of A, and the component 
Fac deleted from the mean-square expectation of C. Obviously if all these 
Interactions exist, no valid tests of significance are possible. 

. Latin squares are used extensively in agricultural research, since there 
Interactions are not a serious issue. In field experiments, for instance, a 
Tow by column interaction would arise if there was a fertility gradient not 
Parallel to the sides of the square. Provided that reasonably large squares 
(at least 5x 5 squares, say) are used—where the distorting effects of any 
Interactions would be less than with smaller Zsquares—and. systematic 
Squares are avoided—except in so far as these might arise occasionally 
from a random arrangement of the treatments—no serious objection to the 
Use of Latin squares could be sustained. In educational research, however, 
the situation is very different. First, large squares are seldom used. 
Secondly, we should note that, whereas the agricultural worker effects a 
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double (row and column) control over a single extraneous influence (soil 
variability), the educational researcher seeks to control two essentially 
different influences. The interactions may then be much more marked. 
Only if the investigator has grounds for believing the interactions to be 
negligible should the Latin-square design be used. 


Table8.5 Components analysis for a Latin-square experiment with 
independent groups in each cell when zero interactions are not assumed 


—— MÀ —————À 
Source of Degrees of 


variation of freedom* OR qune aput 
— 5 i CMM 

2 2 2 2 2 2 A 

A a—1 o0 +noR+|1—=) onc 04nd- Onc +a 
d, 
1 

B a-1 c? - noi 4- ( -; ornctorctaas 

a 

C 2 2 2 2 2 2 e 

a-1 0 nog |1—- | cánc-- Gan Oc 40€ 
a 
2 

R (a—1)(a—2) eene (1 ) risen abet ote 
a 


Within cells a?(n—1) c? 
* These are written for an a xa Square with scores in each cell. 


In the experiment on spelling, for instance, the use of the Latin square 
would be justified only if we could be sure that no appreciable inter action 
exists between the word lists and the test forms, between the groups aP 
the word lists, and between the groups and the test forms. It woul be 
preferable, too, for any investigator's ‘hunch? that all these interactions at 
negligible to be supported by some empirical evidence. If such evidence 5 
not available, the factorial design, if practicable, is to be preferred. 


8.4 The 2x2 square 


s - wW 
The 2x2 square stands on its own in so far as the residual effect 10 
disappears. This is because there are no degrees of freedom available. 
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Table 8.6 Scores of two groups on two test forms presented on two 
Occasions (change-over design) 


Occasions Person 
| 1 2 totals Totals 


323 

316 

Totals 309 330 639 
Totals of forms T [313 
IL |326 


N.B. (a) The two scores in each row belong to the same person, 
(b) The four cell totals are shown circled. 


Key: I — Form 1 of test II - Form 2 of test 


pe Seen that in an axa square the degrees of freedom for the residual 
tee of Squares are (a—1)(a—2), and when a = 2 this becomes zero. This 
nks up with the fact that only two arrangements of treatments are 


Possible, i.e, and and if one of these arrangements yields 
el, 1:2 


di treatment sum of squares, the second arrangement will then 
ied DB yield the same treatment sum of squares, too. The 3 degrees of 

fe om between cells are fully accounted for by the single degree of 
eedom for each of the rows, columns and treatments. 
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The 2x2 square is, nevertheless, useful when it is groups and not 
individuals who constitute the rows. (Alternatively, a series of squares with 
individuals constituting rows could be used.) A frequent use of the square 
in educational research is that of determining a difference in difficulty 
between two tests, and more especially between two parallel forms of the 
same test. One group of children would take form 1 followed by form 2, 
and a second group form 2 followed by form 1. The effect of practice—the 
increase in score to be expected on the form taken second—would then be 
removed in a comparison of the forms from the two groups combined. We 
might, of course, be interested in the practice effect also. This would be 
obtained from a comparison of the total score on the forms taken second— 
forms 1 and 2 combined—with that on the forms taken first. Since m 
order of the test forms changes for the second group, this design is referre 
to as a change-over (also a cross-over) design. m. 

Suppose that with five persons in each group (in an actual experimen 
the number would be considerably larger) the test scores are as shown F 
table 8.6. The analysis would, in the first place, partition the total sum o 
squares into sums for between cells and within cells. The former sum wow! 
then be divided into sums of squares for occasions of testing (columns), 
groups (rows) and treatments (test forms 1 and 2). Since within each Pi 
the scores in the two cells are not independent, a pair of scores (one on eac 
occasion) being derived from the same person, a sum of squares for between 
persons within groups must be extracted from the within-cells sum (s€? 
table 8.4). The remaining part of the within-cells sum is the persons X 


occasions interaction within groups. The calculation is as show 
below. 


639 x 639 
1. Total sum of squares = (33742574 +++ +34?)-—39 
Sum of 20 terms 
= 20,681 — 20,416-05 
= 264-95 
2\ 639 
2. Between-cells sum of squares — (= e CIS JE) 


= 20,449 — 20,416-05 
= 32:95 
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3. Within-cells sum of squares = 264-95—32-95 
= 232-00 
4. Between-occasions (4) sum of] _ 309? , 330? 639? 
equates = 70 710 20 
= 20,438:10 — 20,416:05 
= 22:03 
5. Between-groups (B) sum of 8237 316? 639? 
squares 7o 0 20 
= 20,418-50 — 20,416-05 
= 2°45 
6. Between-forms (C) sum of 313? 326? 639? 
iua 7010 20 
= 20,424-50 — 20,416-05 
= 845 


ppt these last three sums combine to give the between-cells sum, 


7. Between persons (P), within } » (z 537 cS Gy 


groups sum of squares 2 TS 2 10 
Sum of 5 terms 
(E "a" eX 
FF 7D} 2 10 
Sum of 5 terms 
8. Remainder sum of squares pats 
(Px interaction, within B) | — Pipe iius 


a between-persons sum has 8 degrees of freedom (4 from each 
oe and the residual sum also has 8 degrees of freedom (4x1 = 4 
Sells Es group), giving together the 16 degrees of freedom for within 
Me vm the sums of squares and degrees of freedom are set out in table 
Pd e effect of both occasions (4) and test forms (C) are tested for 
53 cance against the residual mean square, since differences between the 

Occasions and between the two test forms are based on scores from the 
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same persons (see also the components analysis, table 8.4). For Occasions, 
therefore, we have 
22-05 
~ 3:00 
which, for 1 and 8 degrees of freedom, is significant at the 5-per-cent level 
(statistical table 2A). For test forms we have 
8-45 


= 7:35 


which, for 1 and 8 degrees of freedom, is not significant at the 5-per-cent 
level. Of course, the obtained difference in mean score, though not 
significant, might still be of practical importance. 

Table 8.7 Analysis of variance of the data in table 8.6 

SS eee 


Sum of Degrees 


Source of variation squares of freedom Mean square 
A 22-05 1 22:05 
B 2-45 1 
G 8-45 1 8:45 
Persons (P), within B — 208-00 8 
PxA, within B 24-00 8 3:00 
Total 264-95 19 


LP aa X ed 


We would usually have no interest in the group differences themselves. 
In experiments of this kind the persons tested are separated into two groups 
at random, in which case no real group difference exists, i.e. the com- 
ponent c2 of table 8.4 is zero. The component c2 shown in table 8.4 is also 
zero, since the residual source of variation does not now exist. Lines 2 and 


4 of table 8.7 could then be combined to give a single source of variation 
for persons. 


Source of variation Sum of squares ^ Degrees of fr eedom 
B 2-45 1 
Persons (P), within B 208-00 8 


___Persons(P),withinB 20800 8 ^ .— 
P 210-45 9 
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8.5 Extensions of the Latin square 


The principle of restricting the randomization of treatments within a Latin 
square can be carried one or more stages further. One additional restriction 
upon the randomization might, for instance, be imposed. This would result 
in three, not two, distinct sources of variation being equalized. The number 
of classifications of the additional source—or the number of levels of the 
new factor—must be the same as the number of treatments, i.e. the number 
of rows (or columns) in the square. Each treatment would then occur once 
and only once with each of the additional classifications, as well as once 
and only once in each row and column of the square. Such an arrangement 
1s termed a Graeco-Latin square.* 

In the 4 x 4 square of section 8.1 the treatments were denoted by C;, 
C;, C, and C,. If we then use the letter D for the additional classifications, 
an example of a Graeco-Latin square would be 


We note that each of the Cs occurs once and only once with each of the 
Ds, and each of the Ds (as well as the Cs) occurs once and only once in 
each row and column. The square could, in fact, be regarded as the result 
of two Latin squares being superimposed, the squares 


(Squares which can be superimposed in this manner are called ‘orthogonal’ 
Latin squares.) 
To develop the illustration of the Latin square in section 8.1, in which 
the columns of the square represent word lists, the rows groups, and the 
* The additional classifications are often denoted by Greek letters (a, £, y, 


€tc.), which are then used in combination with Latin letters for the original 
Square. Hence the name Graeco-Latin square. 


170 Experimental Design in Education 


Cs forms of testing, the new classification D could now represent the order 
in which the different combinations of word lists and test forms are 
presented. Thus, in the above Graeco-Latin square group 1 (in row 1) would 
take list 1 in form 1 first, list 2 in form 2 second, list 3 in form 3 third, and 
list 4 in form 4 fourth. Again, group 2 (in row 2) would take list 4 in form 2 
first (since C,D, is in column 4 of the square), list 3 in form 1 second (since 
C,D, is in column 3), list 2 in form 4 third, and list 1 in form 3 fourth; and 
similarly for groups 3 and 4. The result would be an exact balancing of 
differences of order, as well as those of word lists and groups, in a com- 
parison of test forms, the Cs. Differences of order would also be balanced 
out in a comparison of word lists and (should the comparison be of any 
interest) of groups. 

We saw that in the 4x 4 Latin square there were 6 degrees of freedom 
available for the residual (table 8.2). In the 4 x 4 Graeco-Latin square 3 of 
these degrees of freedom will be taken over by the additional classification 


D. Generally for an axa square the degrees of freedom will separate as 
follows: 


a—1 
a—i 
a—1 
a—1 
Residual (a—1)(a—3) 


Dawa 


For a 4x4 square there would be only 3x1 = 3 degrees of freedom 
available for the residual, and for a 3 x 3 square there will be no degrees of 
freedom available at all. (A 2 x 2 Graeco-Latin square, of course, cannot 
even exist!) Obviously for small squares it is highly desirable that groups 
and not individuals be tested in the rows, or else that a combination © 
several Graeco-Latin squares be used. Again, systematic squares should be 
avoided as an automatic choice. The components analysis of a Graeco" 
Latin-square experiment follows the pattern of tables 8.3 and 8.4 with an 
additional line for the new source of variation D appearing (the mean 
square expectation for D being similar to that for 4 and C), and with 
(a—1)(a—3) replacing (a—1)(a—2) as the number of degrees of freedom 
for the residual R. 

Graeco-Latin squares have been constructed for all numbers of treat- 
ments from three to twelve with the exception of six and ten. They also 
exist for all odd numbers of treatments. Examples of Graeco-Latin squares 
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for up to twelve treatments are given by Cochran and Cox (1957). With 
regard to their usefulness in educational research, it must be borne in mind 
that they are subject to all the disadvantages of Latin squares. In particular, 
each main effect would be confounded with the first-order interactions of 
all the other effects, and also with the higher-order interactions. 

The principle of equalizing an additional source of variation among the 
treatments could, in theory, be carried further. Thus, a fourth source of 
variation E could be simultaneously controlled, given that the number of 
classifications of E is the same as the number of treatments. Again, each 
classification of E would occur once and only once in each row and column 
of the square, and once and only once with each of the C and D classifica- 
tions. Such an arrangement is termed a hyper-Graeco-Latin square. An 
example for a 4x4 square is shown below. We may note that the 15 
degrees of freedom for between cells is now fully accounted for by the 3 
degrees of freedom for each of 4 (rows), B (columns), C, D and E. (It 
follows, then, that if the rows represent individuals, a single 4 x 4 square is 
of little value; a combination of several squares would be needed.) 


The writer knows of no instance where such squares have been fruitfully 
applied in the field of education. 
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Chapter 9 Covariance Designs 


9.1 Statistical control 


In the randomized-blocks design described in chapter 5, differences result- 
ing from a factor extraneous to the treatments studied were controlled. 
Blocks could be formed on the basis of, say, intelligence, so that all the 
persons in any one block were within the same (narrow) intelligence range. 
The results were then affected to but a small extent by differences Jn 
intelligence. ‘ 
A practical disadvantage in controlling intelligence experimentally E 
this way is that the score of each person on a suitable test must be knw 
beforehand. Again, many of the persons available for the experiment migh 
not have scores or I.Q.s falling conveniently into the blocks, so that ROS 
of the information available (or potentially available) would be sacrificed. 
Sometimes, too, it might not be advisable or even possible[to control V 
extraneous factor by forming blocks at the beginning of the experimen’ 
Such a situation would arise, for instance, in an investigation into 
effectiveness of different methods of studying. The length of time n 
person spends studying could not realistically be fixed beforehand. Th 
leads us to the possibility of controlling differences in the extraneous facto 
(intelligence, length of time studying, etc.) in another way. One‘such he 
is provided by the analysis of covariance. The control is exercised statist! 
cally rather than experimentally. : he 
In the methods experiment, for example, with intelligence as oa 
extraneous factor, we would begin by considering the connection irme n 
final attainment and intelligence within each method group. In particu p 
the regression of attainment on intelligence would be determined. This p 
&ression—essentially a measure of average increase in attainment for an 
increase in intelligence—would differ from group to group. But it mig i 
well be that these differences in regression would be small, or at any se 
Statistically insignificant, If SO, one population regression could be assu™ 
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for all the groups. This would be estimated from an average of all the separ- 
ate within-groups regressions, and would then be applied to the differences 
in mean scores (attainment and intelligence mean scores) of the groups. 

If, for instance, one group were superior in intelligence—its intelligence 
mean score being above the overall intelligence mean—the extent of the 
corresponding superiority expected in attainment could be determined. 
This expected superiority would depend solely on the group’s intelligence 
mean expressed as a deviation from the overall mean, and the regression 
of attainment on intelligence. If the extent of this expected superiority were 
then subtracted from the group’s actual attainment mean, the attainment 
mean would be adjusted for the difference in intelligence. Again, if another 
group were inferior in intelligence, the extent of the corresponding 
inferiority expected in its attainment would be determined. This amount 
would then be added to the group’s actual attainment mean to give a mean 
adjusted for the difference in intelligence. 

When the actual attainments of all the groups have been adjusted in 
this way, it may be found that the rank order of adjusted means differs 
Considerably from that of the obtained means. For instance, a group with a 
markedly inferior average intelligence would merit a substantial ‘adjust- 
ment’, and with only an average actual attainment would end up with a 
high adjusted attainment. Similarly, a group with high intelligence but only 
an average attainment would necessarily have an adjusted attainment well 
below average. 

_ Another possibility is that, whereas the actual group attainments might 
differ significantly, the adjusted attainments would not. We should note, 
too, that it is the differences among the adjusted attainments which are the 
More important, for these are the differences that would result if all the 
groups had been equated for intelligence in the first place. If the adjusted 
means were found to differ significantly, we could be sure that this was not 
a consequence of the group differences in intelligence, for these differences 
—even if they happened to be considerable—have been controlled 
Statistically. Because of this control, we could arrive at much the same 
Tesult as if the groups had been matched for intelligence in the first place. 


9.2 An analysis of covariance 


The detailed procedure in an analysis of covariance can be illustrated by an 
Investigation similar to one conducted by Jones et al. (1957) into certain 
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educational aspects of English-Welsh bilingualism. Children brought upin 
Welsh-speaking homes might, it is thought, be less fluent in English eren 
at the end of their primary-school education—the medium of schoo 

instruction being English—than monoglot English-speaking children from 
homes where little or no Welsh is heard. For the purpose of the investiga- 
tion, children are divided into categories of linguistic background, je 
possible division being (1) a category of children where Welsh is habitual ly 
spoken at home by both parents; (2) a category of children where Welsh is 
spoken only occasionally at home; and (3) a category of children where 
Welsh is never spoken at home. Random groups of ten-year-old a 
are then selected from each category, and each group is given (a) a test 0 


Table 9.1 Scores of three groups on two tests (for analysis of covariance) 


——— —»— ÉEE— 


Groups 
n a NET 
1 2 3 

Test X Test Y Test Y Test Y Test Y Test Y 
ag 37 39 37 42 41 
38 34 41 42 46 39 
40 37 44 40 48 37 
43 39 48 40 50 42 
e 37 50 46 52 46 
48 41 52 43 53 44 
50 42 54 47 56 48 
a 39 59 45 59 45 
55 40 60 53 60 52 
56 46 65 50 64 48 


Sum 461 392 512 443 530 452 
Mean 461 392 512 — 443 530 452 
Test X Test Y 
Overall sum 1503 1287 


Overall mean 50-1 42-9 
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non-verbal intelligence, and (b) a test in the usage of English. Table 9.1 
shows sets of test scores for ten children in each group. (In an actual 
investigation the group numbers would be far larger.) X refers to non- 
verbal intelligence and Y to English. We see that the group means in both 
tests differ considerably. The aim is to adjust the differences in English for 
the differences in intelligence. 

We begin by calculating the regression of English (Y) on intelligence 
(X) for each of the groups separately. To do this, the corrected sum of 
squares for Y and the corrected sum of products XY are needed for each 
group. (A corrected sum of squares has been explained in section 2.2 as the 
Sum of squares of scores expressed as deviations from their mean; a 
corrected sum of products is similarly the sum of products of scores, each 
Score being expressed as a deviation from its own group mean.) The 
calculation of these sums is shown in table 9.2. (The corrected sums of 
Squares for Y have also been calculated, as these will be needed later.) The 
regression of Y on X is then given by dividing the sum of products by the 
ze This works out at 0:357 for group 1, 
x 
0:493 for group 2 and 0-417 for group 3. 

. Each regression shows the overall extent to which the Y scores increase 
With Y. Consider, for instance, the regression 0-357 for group 1. This is 
Shown by the slope or gradient of the oblique line in figure 8. The points 
representing the X and Y scores of the ten children are also shown. The 
regression line, which also passes through the point representing the X and 
Y mean scores of the group, is such that the vertical (Y) deviations of the 
Doints from it are a minimum, i.e. the sum of the deviations above the line 
equals the sum of the deviations below the line, and the sum of the squares 
of all the deviations is less than would be the case for any other straight 
line. The line is a ‘best-fitting’ in this sense. In the same way, the regressions 
for groups 2 and 3 are also best-fitting lines for the scores of their group. 

A further point is that a sum of squares of the deviations from the 
regression line—the sum which is a minimum for the particular line—is 


( 


2 
given for each group by È yor. If the regression is based on n 
26 


sum of squares for X, i.e. by 


scores, this sum will have (n—2) degrees of freedom, not (n—1), since a 
further degree of freedom is lost by the fixing of the regression line. This sum 
of squares is shown for the three groups in the first three lines of table 9.3. 
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Table 9.2 Calculations of sums of squares and products for the groups 
in table 9.1 


c——— e —— "—G—a——— — 


() Group 1 

EX = 344384 -++ 456 = 461: EX? = 34743824 --- +56? = 21,731 
*. Corrected sum of squares, Z x? = 21,731 -41x461 = 478-9 

EY = 37+34+ +++ +46 = 392: EY? = 372 4342 + -++ +46? = 15,466 
-. Corrected sum of squares, E y? = 15,466 -222x 207 = 99-6 

Z XY = 34.37 +38.34+ --- +56.46 = 18,242 
-. Corrected sum of products, E xy = jin Sae = 1708 


(ii) Group 2 
EX = 394414 +--+ 465 = 512: E X? = 39244124 --- 465? = 26,868 


-. Corrected sum of squares, £ x? = 26,8682 5512 = 653-6 

EY = 374424 +++ +50 = 443: E Y? = 37244224 +++ +50? = 19,841 
“^ Corrected sum of squares, Z y? = 19,841 oe = 2161 

IXY = 3937441424 .-- +65.50 = 23,004 


*. Corrected sum of products, xy = 23,004-212 x = 322-4 


(iii) Group 3 

EX = 424464 +++ +64 = 530: EX? = 4224.4624 «++ +642 = 28,510 
*. Corrected sum of Squares, Xx? = 28,510 - 2907390 = 4200 

EY-419439 +++ +48 = 452: E Y? = 41243924 +++ +482 = 20,564 
«. Corrected sum of Squares, Xy? = 20,564- 22x12 = 133:6 

E XY = 42.41 +46.39 + +++ 16448 = 24,131 


*. Corrected sum of products, Z xy = 24,131 530x452 _ 175:0 
> DN Sans eee eS 
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Note to table9.2 Just as the corrected sum of squares for X is given by 
je , €qxXY 
rx? =X a 


2 
and that for Y is given by Ly? == ».-90 


, so the corrected sum of 


products is given by a formula of the same mathematical ‘shape’, i.e. 


Exypz- prp M 


45 


35 


35 40 45 50 55 


X 
Figure 8 The regression of Y score on X for group 1 in table 9.1 


The question of crucial importance is whether the differences among the 
separate group regressions are statistically significant. If not, the regressions 
could be averaged, the average regression being the best estimate of the one 
population regression. This average regression would be found by adding 
together the sum of squares, Xj x^, and the sum of products, > xy, 
for the separate groups—these are also shown in the first three lines of 
M ac for these sums. This is 
Ax 
Shown in line 5 of table 9.3. We see that the average regression comes to 


table 9.3—and calculating the quotient 


M 


Table 9.3 Analysis of covariance of the data in table 9.1 (variation within groups) 


M € ————— P HM  ——À 


Deviations from mean Deviations from regression 
Regression 

Degrees xy Degrees sy (£ xy) Mean 

Source of variation of freedom Sx? Xxy Sy? Ex offreedom +” Zx? square 
Within group 1 9 478-9 170-8 99-6 0-357 8 38-68 4:84 
Within group 2 9 653-6 322-4 216-1 0-493 8 56-07 7-01 
Within group 3 9 420-0 175-0 133-6 0-417 8 60-68 7:59 
24 155-43 6:48 
Within groups (1+2+3) 27 1,552-5 668-2 449-3 0-430 26 161-71 6:22 


a 


8LI 


uoyvanpy ui usisaq ]pjuawn2dxy 
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0:430. The sum of squares of the deviations from this average regression 
has also been obtained—in precisely the same way as the corresponding 
sums for the separate regressions. This sum, 161-71, has been obtained 
from all the scores, and can be compared with 155-43 (line 4 of the table), 
the sum obtained by adding together the sums for the separate regressions. 


45 


40 


45 50 55 
x 
Figure 9 The within-groups regression of Y score on Y, and the 
mean X and Y scores of the groups (data from table 9.1). The 
overall Y mean is shown by the horizontal line. Each adjusted 


group mean is given by overall mean plus the deviation from the 
regression line (di, d; or d; as shown above). 


The former sum is necessarily larger than the latter, since the average 
regression does not provide the best-fitting line for any of the groups. The 
difference between the two sums, however, gives a measure of the variation 
among the group regressions. The difference, which works out as 6-28, 
has 26—24 — 2 degrees of freedom (necessarily so, since there are three 
Separate regressions), and the mean square is tested against the mean 
square from the sum of the separate regressions, 6-48. We see that 


3-14 
P= 648 « 1 so that the differences among the separate group regressions 
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are not significant. It follows that the averaging of the separate regressions 
is legitimate, and one can proceed to adjust the group means for Y on this 
basis. 

Figure 9 shows the X and Y group means in relation to the average 
regression of Y scores on X, 0-430. The precise extent of the adjustments 
to the Y means necessary because of the differing Y means is now apparent. 
While differences among the actual Y means are shown by the heights of 
the points—or their deviations from the horizontal line drawn through the 
overall mean—adjusting to take account of the Y differences brings in the 
regression line as a new basis for comparison. It is the vertical deviations of 
the points from the regression line (not the horizontal line) which measure 
differences among the Y means adjusted for the differences in X. We see, for 
instance, that the points for groups 2 and 3 are about the same distance 
above the regression line, so the adjusted means for these groups will differ 
only slightly. The point for group 1, on the other hand, is below the 
regression line, so that its adjusted mean will be the lowest of the three. 
(Even so, the differences from the groups 2 and 3 will be considerably less 
than for the unadjusted means.) The amount of adjustment will be 
determined by the product of (a) the deviation of the Y group mean from 
the overall X mean, and (b) the regression of Y on X. Table 9.4 records 
these products for each of the groups, together with the adjusted means. 


Table 9.4 Calculation of the adjusted group means (data from table 9.1) 


a re 
Deviation from Deviation x 
Group X mean overall mean regression Y mean Adjusted Y mean 


(50-1) (0-430) 
1 46-1 -—40 -172 392 392-(-172 
= 40:92 
2 51-2 1-1 0-47 44-3 44:3—0:47 
= 43:83 
3 53-0 2:9 125 452 452-125 
= 43:95 


——————— — 


: There remains the question of whether the adjusted means differ 
significantly. To answer this, it is the deviations from the regression that 
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must be considered, the deviations between groups and within groups. The 
sum of squares of the deviations within groups has already been obtained 
(line 5 of table 9.3). The sum for between groups is obtained by subtracting 
this from the sum of squares of the deviations for the total variation, i.e. 
the sum for all the scores irrespective of their separation into groups. We 
begin, then, by calculating the sum of squares of the deviations from the 
mean for the total variation. This follows the same pattern as that for the 
separate groups (table 9.2), and is given in table 9.5. The sum of squares of 


2 
x 
the deviations from the regressions then follows from zy 2 


as before. It is shown on the first line of table 9.6. 

The second line of table 9.6 is the within-groups variation previously 
obtained in table 9.3. The sum of squares of the deviations from regression 
in this line is subtracted from the sum in the first line to give the sum of 
Squares for between groups, 50-84. The degrees of freedom are also 
subtracted, giving 28—26 = 2. The mean square for between groups is 
therefore 25-42. This is tested for significance against the mean square for 
25.42 
lore 
freedom is significant at the 5-per-cent level (statistical table 2A). We may 
conclude that there are real group differences in performance in the 
English test, even after allowance has been made for the group differences 
in non-verbal intelligence. 

The reader will observe that the between-groups sum of squares has 
been obtained by subtraction, i.e. total sum minus within-groups sum, 
which is a reversal of the usual procedure. He may wonder, too, why a 
between-groups sum of squares could not have been calculated directly 
from the between-groups sums X x”, Yjxy and Dy’, the sums of 
squares and products of the deviations from the mean. (These—the 
missing entries of the third line of table 9.6—could easily have been 
obtained from the group sums X X and X Y in the usual way. Again, 
the within-groups sums in line 2 could then have been verified by sub- 
traction from line 1.) The between-groups regression, however, which is 
what we would then be using, is directly bound up with the differences 
among the Y means— differences which are being adjusted and then tested 


for significance—and so the method of adjustment would not be indepen- 
dent of the differences tested. 


within groups, giving F — = 4:09, which for 2 and 26 degrees of 


Table 9.5 Calculation of sums of squares and products for the total 
variation in table 9.1 


ee 
DX = 34438+ +--+ +64 = 1503: EX?234 4384: +64? = 77,109 


1503 x 1503 
30 


Corrected sum of squares, X; x? = 77,109— = 1808-7 


DY = 37+34+ +++ +48 = 1287: Dy’ = 37743424 © +48? = 55,871 


287 
Corrected sum of squares, Y; y? = 55,871 ee = 658-7 


Y. XY = 34.37+38.34+ +++ +64.48 = 65,377 


1503 x 1287 
Corrected sum of products, $; xy = 65,377 = 898-3 


ee l 


TSI 
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Finally, we should be aware that adjusting group means in the manner 
described could lead to the differences among them being increased. 
Suppose that in our present illustration a test of comprehension of Welsh 
had also been given. (We will assume that groups 2 and 3, though hearing 
relatively little or no Welsh at home, had been learning the language as a 
school subject.) Group 1, despite its lower intelligence, would almost 
certainly have the highest attainment in Welsh. If, therefore, the within- 
groups regression still showed an increase in attainment with intelligence, 
adjusting for the differences in intelligence would necessarily increase the 
difference between group 1 and the other groups. The same situation would 
occur in a methods experiment, if the group with the lowest intelligence 
had nevertheless secured the highest attainment mean. The full extent of 
the superiority of the particular method is then brought out only by the 
adjusted means. 


9.3 The model 


The model for the experiment described in the last section may be written 
as 


yug = MA t Bx tei; 


where y;; is the score of person j in group i in the test of English; 
M is a component common to all the English scores; 
A; is a component common to all the English scores of group i; 
B is the regression of English score on intelligence, common to all 
the groups; 
xı; is the score of person j in group i in the test of intelligence, this 
Score being expressed as a deviation from the overall mean 
score, i.e. the mean intelligence score of all the groups; 
and ej; is a component specific to person j of group i. 

Tn the particular experiment i takes on the values 1, 2 and 3, and j the 
values from 1 to 10. Comparing the model with that for randomized 
groups (section 3.3), we see that, apart from the attainment score now 
being denoted by y; the only difference is the introduction of the 
regression term Bx; The four contributions to the score y;; are all 
independent of each other, and as before e;,—the element of randomness 
in the design—is such that for any given i it can be regarded as drawn 
from a normally distributed population with means of zero and a variance 


Table 9.6 Analysis of covariance of the data in table 9.1 (variation between and within groups) 


Deviations from mean Deviations from regression 
Regression 
Source Degrees X xy Degrees yi (Ex» Mean 
of variation of freedom Xx Dx Xy» > x? of freedom y Xx? square 
Total 29 1,808-7 898.3 658-7 28 212:55 
Within groups 27 1,552.5 668:2 449:3 0:430 26 161-71 6:22 
Between groups 2 2 50:84 25:42 


P81 
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of c?, which is the same for all the populations, i.e. all values of i in turn.* 
In the same way, A; is regarded as being drawn from a normally distributed 
population with a mean of zero and a variance of cå. The components 
analysis then follows the pattern of that for a randomized-groups design 
(table 3.5), the only difference being 1 degree of freedom less for both the 
individuals (or within groups) and total sources of variance. 

It should be added, too, that, as with the randomized-groups design, 
there is no necessity for equal numbers of individuals in each group. The 
model need not assume equal numbers in the groups, and the methods of 
calculation can easily take account of groups of differing size. 


9.4 Basic assumptions 


The analysis of covariance exerts a strong attraction for investigators who 
find it difficult, or inconvenient, to control an extraneous factor experi- 
mentally, so it is well to list the assumptions on which the analysis depends. 
As in the case of other applications of the analysis of variance, these 
assumptions include normality of distribution and homogeneity of 
Variance (sections 2.3 and 2.6). They also include the assumption men- 
tioned earlier in this chapter, namely that the separate within-groups 
regressions should differ only by chance. It has also been assumed— 
though this has not hitherto been stated—that these regressions are linear, 
i.e. the deviations of the group means from a straight-line fit are themselves 
chance, so that in particular the means do not follow (or deviate from) a 
curved regression line.[ The basic assumptions may then be listed as 
follows: 

1. The attainment scores Y in each group must be regarded as a 
random sample from a population of possible scores. The regression 
of Y on X—the measure forming the basis of the adjustment—is 
then the same for all these populations. (If this assumption happens 
to be false, no basis for adjustment exists.) 

2. The regression of Y on X, common to all the populations, is linear. 


* This assumption was not tested in section 9.2, but the mean squares from 
the separate groups (recorded in table 9.3) are all estimates of c?, and should 
therefore differ only by chance. Bartlett’s test (section 3.6) would confirm that 
the differences are not significant at the 5-per-cent level. 


. TSee McNemar (1962) for a description of a test for the assumption of 
linearity of regression. 
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3. The adjusted scores in each of the populations are normally distri- 
buted (the assumption of normality of distribution). 

4. The adjusted scores in each of the populations have the same 
variance (the assumption of homogeneity of variance). 

In addition, if as a result of obtaining statistically significant differences 
among the adjusted means, we wish to attribute these differences to the 
differing characteristic of the groups (e.g. to the degree of ‘Welshness’ of 
background in the illustration of section 9.2), it is essential that each group 
be a random sample from a population having the same characteristics. 

‘(This assumption is somewhat wider in scope than assumption 1 above.) 
It is also important that the ¥ measures be unaffected by the group charac- 
teristics. Thus, in the illustration of section 9.2 it would be unwise to use 
scores on a verbal intelligence test as the X measures, since these would be 
affected by *Welshness' of background. In particular, group 1 who come 
from homes where little English is spoken would have their test scores 
artificially depressed, and so an adjustment for intelligence would result 
in over-compensation. (In a methods experiment, on the other hand, where 
scores on an intelligence test or an attainment test given before the 
beginning of the experiment are used as the basis for adjustment, this 
requirement would obviously be satisfied.) We may, therefore, add to the 
four assumptions stated above the following: 

5. The persons tested in each group constitute a random sample of a 
population defined by the characteristics of the group (or, at any 
rate, by the particular characteristics under investigation). 

6. The X measures are unaffected by the group characteristics. 


9.5 Further considerations 


The covariance design has been presented as similar to a randomized- 
groups design, except that scores on an additional factor are secured for all 
persons tested. As we have seen, the design then achieves much the same 
purpose as a randomized-blocks design, the additional factor being 
controlled statistically not experimentally. There is no reason, however, 
why an analysis of covariance could not be used in conjunction with 2 
randomized-groups or factorial design, or again with a Latin-square 
design, provided that scores on an additional factor are secured for all the 
persons tested, and that all the basic assumptions as set out in the previous 
section are satisfied. The appropriate model would be the same as for the 
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design when an analysis of covariance is not used apart from an additional 
term, a term similar to Bx;; in the model of section 9.3. Examples of these 
designs are given by Federer (1955) and Snedecor (1956). 

The principle of controlling an extraneous factor statistically by an 
analysis of covariance could be extended to more than one factor. Adjust- 
ments would then be made by means of the multiple regression of the 
(criterion) attainment scores on the extraneous factors controlled. Thus, in 
comparing the attainments of randomized groups we could control both 
intelligence and attainment on a suitable pre-test. The advantage gained— 
judged by the reduction in the mean square for error—would depend upon 
the magnitude of the multiple correlation compared with that of the single 
correlation between the final attainment and either intelligence or the 
initial attainment, whichever is the higher. Experience in the educational 
field has shown that the multiple correlation is often not very much higher 
than the highest of the two or more correlations between the criterion and 
the separate measures—always provided that the measures already 
selected for possible statistical control are among the most suitable. (There 
would, of course, be no point in controlling a measure that had a near-zero 
correlation with the criterion if other measures were readily available.) The 
advantage gained by simultaneously controlling more than one measure 
would then be only slight.* Another possibility would be that combining 
the two or more extraneous factors into a single composite measure at the 
outset (see, for example, Dunnette and Hoggatt 1957). 

In conclusion, some further comparisons between the covariance and 
randomized-blocks design may be helpful. With randomized blocks the 
controlled measure is used only to form the groups; it plays no further 
part in the analysis. The measures, too, must be known at the beginning 
of the experiment. With analysis of covariance, on the other hand, the 
controlled measures form the basis of the entire analysis. They could be 
secured, however, either during the experiment itself (e.g. the length of time 
needed to complete a task) or even afterwards (e.g. scores on an intelligence 
test). 

With randomized blocks, an interaction between the treatments and 
the controlled measure may prove to be of particular importance—more 


* If R is the multiple correlation and r the correlation for a single controlled 
measure, the extent of the advantage will depend upon how much V1—R? 
is less than V1—r?. 
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important possibly than the main effect of the treatments themselves. An 
interaction does not reveal itself directly in a covariance design, though the 
assumption of a common (population) regression is substantially equivalent 
to one of there being no interaction. 

This leads us on to noting that the assumptions of the analysis of 
covariance are the more restrictive. Randomized blocks require only 
normality of distribution and homogeneity of variance. In particular, no 
assumption as to the nature of the regression is required. We may conclude, 
then, that the analysis of covariance should be preferred only when 
practical considerations militate against the use of randomized blocks, and 


when confidence is justified that its more restrictive assumptions are 
satisfied. 
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